General

SRE Weekly Issue #150

lex

December 2, 2018

Articles

5 Lessons Learned From Writing Over 300,000 Lines of Infrastructure Code

This article is a condensed version of a talk, but it stands firmly on its own. Their Production-Grade Infrastructure Checklist is well worth a read.

Yevgeniy Brikman — Gruntwork

OVMC, EORH Hope To Have Emergency Rooms Back Online

More and more, the reliability of our infrastructure is moving into the realm of life-critical.

Thanks to Richard Cook

Linda Comins — The Intelligencer for this one.

SREcon EMEA 2018 conference notes

Detailed notes on lots of talks from SRECon, with a great sum-up at the top discussing the major themes of the conference.

Max Timchenko

Developers On Call

Drawing from an @mipsytipsy Twitter thread from back in February, this article is a great analysis of why it’s right to put developers on call and how to make it humane. I especially like the part about paying extra for on-call, a practice I’ve been hearing more mentions of recently.

John Barton

AWS Says It’s Never Seen a Whole Data Center Go Down

Really? Never? I could have sworn I remembered reading about power outages…

Yevgeniy Sverdlik — DataCenter Knowledge

Confusion over medicine names threatens lives

Lots of good stuff in this one about preventing mistakes and analyzing failures.

Rachel Bryan — Swansea University

Outages

SRE Weekly Issue #149

lex

November 25, 2018

General

Comments

View on sreweekly.com

Articles

Cloud Functions pro tips: Using retries to build reliable serverless systems | Google Cloud Blog

But does that mean you don’t need to think about reliability issues associated with large-scale distributed systems? The answer is, not completely. While there are many things that GCP and Cloud Functions handle behind the scenes, you still need to keep a couple of best practices in mind while building a reliable serverless solution.

Slawomir Walkowski — Google

Emotet malware runs on a dual infrastructure to avoid downtime and takedowns

The Emotet malware gang is probably managing their server infrastructure better than most companies are running their internal or external IT systems.

Catalin Cimpanu — Zero Day

DBMS Musings: Replication and the latency-consistency tradeoff

Designing a distributed data store is about juggling competing priorities. This author discusses the latency penalty you pay for synchronous replication, and why you might want it anyway.

Daniel Abadi

Capacity planning for Etsy’s web and API clusters

Learn how Etsy designed tooling and a repeatable process to forecast resource usage.

Daniel Schauenberg — Etsy

Orchestrating Chaos using Grab’s Experimentation Platform

Check out how Grab implemented chaos engineering.

Roman Atachiants, Tharaka Wijebandara, Abeesh Thomas — Grab

Predictive test selection to ensure reliable code changes

Neat idea: use machine learning to select which automated tests to run for a given code change. The goal is a high likelihood of finding bugs while running fewer tests than traditional methods.

Mateusz Machalica, Alex Samylkin, Meredith Porth, and Satish Chandra — Facebook

Progressive Service Architecture At Auth0

In this blog post, we are going to discuss how the Auth0 Site Reliability team, led by Hernán Meydac Jean, used a progressive approach to build a mature service architecture characterized by high availability and reliability.

The system in question is a home-grown feature flags implementation.

Dan Arias — Auth0

Outages

The usual glut of Black Friday outages. I hope you all had an uneventful Friday.

J. Crew
Lowe’s
Netatmo (smart thermostats)
John Lewis
AWS in Seoul, South Korea
- The outage took down multiple AWS customers including banks and a cryptocurrency exchange.
Walmart
Makro
Facebook
LastPass
Microsoft Azure
- Linked is a detailed followup post describing three distinct “root” causes.

SRE Weekly Issue #148

lex

November 18, 2018

General

Comments

View on sreweekly.com

Articles

Open-Sourcing Our Incident Response Training

Last year, PagerDuty shared their inident response documentation. Now they’ve posted their training materials as well!

PagerDuty

Validating performance and reliability of the new Dropbox search engine

Dropbox’s write-heavy read-light usage pattern makes this architecture overview worth a read.

Diwaker Gupta — Dropbox

Overload control for scaling WeChat microservices\

There are two reasons to love this paper. First off, we get some insights into the backend that powers WeChat; and secondly the authors share the design of the battle hardened overload control system DAGOR that has been in production at WeChat for five years.

Adrian Colyer — The Morning Paper (review and summary)

Zhou et al. (original paper)

The Time Our Provider Screwed Us

A tale of a nearly business-ending security incident and outage. Transparency and solid incident management helped them survive the event and prosper.

Paul Biggar

How Honeycomb Has Changed the Way Travis CI Operates Their Business

The section titled “A surprising discovery” is really thought-provoking:

t turns out that a single (bot) user was sending us a lot of traffic to a particularly slow endpoint. So while this was impacting the p99 latency, it was in fact not impacting any other users.

Igor Wiedler — Travis CI

Analyzing the GitHub outage

An (external) analysis of the GitHub outage, with a discussion of how Orchestrator reacts to a network partition.

Ayende Rahien

Some notes on running new software in production

I’m working on a talk for kubecon in December! One of the points I want to get across is the amount of time/investment it takes to use new software in production without causing really serious incidents, and what that’s looked like for us in our use of Kubernetes.

Julia Evans

Outages

Google Cloud Platform (and possibly CloudFlare)
- The big outage this week occurred when an ISP in Africa accidentally advertised one of Google’s IP blocks over BGP, effectively black-holing traffic originally destined for GCP. This article suggests that CloudFlare might also have been affected, and it includes a statement from the offending ISP’s CEO.
Microsoft Outlook
Instagram
Basecamp 3
Second Life
Heroku followup report: Incident #1655 (October 30)
Facebook

SRE Weekly Issue #147

lex

November 11, 2018

General

Comments

View on sreweekly.com

Articles

Honeycomb’s Charity Majors: Go Ahead, Test in Production

This is an excellent summary of a talk on testing in production last month.

“Distributed systems are incredibly hostile to being cloned or imitated, or monitored or staged,” she said. “Trying to mirror your staging environment to production is a fool’s errand. Just give up.”

Joab Jackson — The New Stack

DBMS Musings: Distributed consistency at scale: Spanner vs. Calvin

The Pros and Cons of Calvin and Spanner, two data-store papers published in 2012. According to the author, Calvin largely stands out as the favorite.

Daniel Abadi

RobinHood: tail latency aware caching – dynamic reallocation from cache-rich to cache-poor

What a cool concept!

RobinHood brings SLO violations down to 0.3%, compared to 30% SLO violations under the next best policy.

Adrian Colyer — The Morning Paper (summary)

Berger et al. (original paper)

Cross shard transactions at 10 million requests per second

With thousands(!) of MySQL shards, Dropbox needed a way to have transactions span multiple shards while maintaining consistency.

Daniel Tahara — Dropbox

Heatmaps Make Ops Better

This is an excellent introduction to heatmaps with some hints on how to interpret a couple common patterns.

Danyel Fisher — Honeycomb

How Automatic Root Cause Analysis Works

This is a neat idea. By modelling the relationships between the components in your infrastructure, you can figure out which one might be to blame when everything starts alerting at once. Note: this article is heavily geared toward Instana.

Steve Waterworth — Instana

Getafix: How Facebook tools learn to fix bugs automatically

Automated bug fixing seems to be all the rage lately. I wonder, is it practical for companies that aren’t the size of Facebook or Google?

Johannes Bader, Satish Chandra, Eric Lippert, and Andrew Scott — Facebook

Outages

Slack in Europe
Netflix
Instagram
Microsoft’s Windows license activation service
- Microsoft has acknowledged a problem affecting its Windows license activation servers in multiple countries that has resulted in users being told their Windows 10 Pro and Enterprise installations are invalid.
Lloyds Bank
GPS ankle bracelets in Australia
- A violent parolee is on the run after part of the GPS tracking system broke down due to Telstra’s network issues on Friday and Saturday.

SRE Weekly Issue #146

lex

November 4, 2018

General

Comments

View on sreweekly.com

Articles

NRE Labs

NRE Labs is a no-strings-attached, community-centered initiative to bring the skills of automation within reach for everyone. Through short, simple exercises, all right here in the browser, you can learn the tools, skills, and processes that will put you on the path to becoming a Network Reliability Engineer.

Applying HumanOps To On-Call

Tips on designing your on-call to be fair to the humans involved, including gems like an automatic day off after a middle-of-the-night page.

David Mytton — StackPath

October 21 post-incident analysis

GitHub’s major outage stemmed from a brief cut in connectivity between two of their data centers.

Errata: Last week I mentioned the possibility of a network cut and cited an article about GitHub’s database architecture. I should have credited @dbaops, who made the connection.

EC2 Packets per Second: Guaranteed Throughput vs Best Effort

Rumors of undocumented packet rate limits in EC2 abound, and I’ve personally run afoul of them. Backed by direct experimentation, this article unmasks the limits.

Matthew Barlocker — Blue Matador

The Day Our DNS Hit an Undocumented Limit in AWS

This sounds an awful lot like those packet rate limits from the previous article…

Chris McFadden — SparkPost

The Woolworths Experiment

Ever hear of that traffic intersection where they took out all of the signs, and suddenly everyone drove more safely? Woolworth’s tried a similar experiment with their stores, with interesting results.

Sidney Dekker — Safety Differently

Deadlines, lies and videotape: The tale of a gRPC bug

Find out how they discovered the bug and what they did about it. Required reading if you use gRPC, since in some cases it falls to obey timeouts.

Ciaran Gaffney and Fran Garcia — Hosted Graphite

Chaos Engineering Case Study : HomeAway

when we sit with a team to plan the experiment, that is when the light goes on… they start realising how many things they missed and they start cataloging what bad things could happen if something goes bad…

Russ Miles — ChaosIQ

SRE Weekly Issue #150

Articles

Outages

SRE Weekly Issue #149

Articles

Outages

SRE Weekly Issue #148

Articles

Outages

SRE Weekly Issue #147

Articles

Outages

SRE Weekly Issue #146

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

General

A message from our sponsor, VictorOps:

Articles

Outages

A message from our sponsor, VictorOps:

Articles

Outages

A message from our sponsor, VictorOps:

Articles

Outages

A message from our sponsor, VictorOps:

Articles

Outages

A message from our sponsor, VictorOps:

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues