SRE Weekly Issue #168

A message from our sponsor, VictorOps:

[You’re Invited] Death to Downtime: How to Quantify and Mitigate the True Costs of Downtime. VictorOps and Catchpoint are teaming up for a live webinar on 5 monitoring and incident response best practices for preventing outages.


This one’s great for folks that are new to SRE, and it’s also an enlightening read for seasoned SREs. What caught me most was the Definition section, on what it means to be an SRE.

Alice Goldfuss

In this articlization of a conference talk, the author lays out 8 common pitfalls in chaos engineering, with detailed example stories related to them. It goes much deeper than mere chaos engineering into the theory of how to operate complex systems.

Nora Jones

Automation can have unintended effects — and can tend to not have the effect we hope it will.

Thanks to Greg Burek for this one.

Courtney Nash

Recently having binged watch Air Emergency, I felt that SREs can learn many things from aviation industry.

Anshul Patel

Lessons learned by a software engineer on supporting their code in production.

Kashyap Kondamudi

CASE stands for Context-heavy, Actionable, Symptom-based, and Evaluated. That last one’s really key. The author proposes setting an expiration time for your alerts after which time you should evaluate them to make sure that they still make sense.

Cory Watson


  • Heroku: (EU) routing issues for ssl:endpoint applications
    • Heroku posted this followup for an outage on April 2.
  • The Travis CI Blog: Incident review for slow booting Linux builds outage
    • The outage happened March 27-28.
  • Azure VMs — North Central US
    • Since deep-linking to Azure incident summaries doesn’t work and this one is especially interesting, I’ll quote it here:

      Azure Storage team made a configuration change on 9 April 2019 at 21:30 UTC to our back-end infrastructure in North Central US to improve performance and latency consistency for Azure Disks running inside Azure Virtual Machines. This change was designed to be transparent to customers. It was enabled following our normal deployment process, first to our test environment, and lower impact scale units before being rolled out to the North Central US region. However, this region hit bugs which impacted customer VM availability. Due to a bug, VM hosts were able to establish session with the storage scale unit but hit issues when trying to receive/send data from/to storage scale unit. This situation was designed to be handled with fallback to our existing data path, but an additional bug led to failure in the fallback path and resulted in in VM reboots.

  • Facebook, Instagram, and WhatsApp

SRE Weekly Issue #167

A message from our sponsor, VictorOps:

[You’re Invited] Death to Downtime: How to Quantify and Mitigate the True Costs of Downtime. VictorOps and Catchpoint are teaming up for a live webinar on 5 monitoring and incident response best practices for preventing outages.


This is an awesome write-up of SRECon, but the part I really love is the intro. The author gives voice to a growing tension I’ve seen in our field, as we try to adopt the tenets of Safety II which can seem to be at odds with traditional SRE practices. There’s a lot here that we SREs need to work out as our profession matures, and I’m really enjoying the process.

Tanya Reilly

Experts recommend trying to keep the concepts of blame, root cause, and hindsight bias out of our retrospective investigations. This insightful article explains that they all stem from the illusion that we are in full control of our systems.

Thanks to Will Gallego for this one.

Ryan Frantz

Here’s a top-notch followup analysis from Mailchimp on the Mandrill outage last month. Their Postgresql DB ran out of transaction IDs (a common failure mode), causing a painful outage. Tons of great stuff here including a mention of rotating ICs every 3 hours to prevent exhaustion and allow them to sleep.


And here’s where things get really interesting. Incidents are never as simple as they seem from the outside, and the 737 MAX situation is no exception. I anxiously await the full report, in which we’ll hear more about the confluence of contributing factors that must have been involved here.

Thom Patterson — CNN

There’s a lot in this, and I don’t feel comfortable summarizing it with a little blurb about lessons learned. Chilling though it is, I’m glad I read it.

Thanks to Sri Ray for this one.

Patrick Smith — The Telegraph

I consider a system to production ready when it has, not error handling inside a particular component, but actual dedicated components related to failure handling (note the difference from error handling), management of failures and its mitigations.

Ayende Rahien


SRE Weekly Issue #166

SRECon was amazing! The talk line-up was mind-blowing, and it was great to meet many of you there. A big thanks to all the speakers for making this one a conference to remember.

A message from our sponsor, VictorOps:

In case you missed it, check out the recording of the recent VictorOps webinar, How to Make On-Call Suck Less. The webinar covers 5 actionable steps every SRE team can take to improve alerting and make on-call suck less:


One of my favorite moments of SRECon: during their talk, Dorothy Jung and Wenting Wang unveiled this choose-your-own-adventure-style game for practicing your incident response skills. See if you can resolve the incident before your stress level gets too high!

Chie Shu, Dorothy Jung, Joel Salas, Dennis So, Sam Faber-Manning, and Wenting Wang — Yelp

Last week was only the second SRECon I’ve managed to attend. Rather than post raw notes from all the talks I attended, I tried something different: I only wrote down the really big stuff that made me think or blew my mind. I’m hoping that just reading this might give those of you that weren’t able to attend a taste of the conference.

Lex Neva

Inspired by SRECon, John Allspaw posted this Twitter thread on the “Humans Are Better At” / “Machines Are Better At” concept.

Who will argue with “make the computers do the easy/tedious stuff so humans can do the difficult/interesting stuff”? (apparently, I will)

John Allspaw

This article goes into what the pilots of the Lion Air 737 Max 8 (and presumably the Ethiopian Airlines one as well) would have had to do in order to regain control over the aircraft. We’re starting to get hints of the task saturation and alert overload both sets of pilots may have faced as they tried to handle the situation:

The Lion Air crew would have had to accomplish this while dealing with a host of alerts, including differences in other sensor data between the pilot and co-pilot positions that made it unclear what the aircraft’s altitude was.

Thanks to Courtney Eckhardt for this one.

Sean Gallagher — Ars Technica

The day before Lion Air’s 737 Max 8 crash last fall, the exact same plane had a similar failure to the one that may have taken that plane down the next day.

Thanks to Courtney Eckhardt for this one.

Alan Levin and Harry Suhartono — Bloomberg

Calvin is interesting for (at least) two reasons: first, it’s designed to work with an existing database, and second, it manages an impressively fast transaction throughput rate.

Adrian Colyer (summary) — The Morning Paper

Thomson et al. (original paper)

This article draws an interesting parallel between two talks at SRECon last week, about making sure that your monitoring doesn’t itself cause incidents.

Beth Pariseau — TechTarget


SRE Weekly Issue #165

As I write this, I’m headed to New York City for SRECon19 Americas, and I can’t wait!  If you’re there, come hit me up for some SRE Weekly swag, made using open source software.

A message from our sponsor, VictorOps:

Reducing MTTA and MTTR takes 5 simple steps. Check out this recent blog series, Reducing MTTA, to find 5 simple steps for improving incident response, lowering MTTA over time and making on-call suck less for DevOps and SRE teams:


As we discover more about the Boeing 737 MAX accidents, this author trolled through the ASRS database looking for related complaints.

Thanks to Greg Burek for this one.

James Fallows — The Atlantic

Learn about ASRS, the Aviation Safety Reporting System. Pilots and other aviation crew can report concerns anonymously, and the results are summarized regularly and reported to the FAA, NTSB, and other organizations.

Thanks to Greg Burek for this one.

Jerry Colen — NASA

I caught wind of a previous Boeing 737 issue from the 90s during a personal conversation this week. There’s an interesting parallel to the current 737 MAX issue, as Boeing blamed pilots for incorrectly responding to a “normal” flight incident for which pilots are routinely trained.

Various — Wikipedia

Dr Justine Jordan gives a personal account of how on-duty napping during extended overnight in-hospital duty hours as a trainee doctor eased her fatigue levels and raised her state of alertness

Dr. Justine Jordan — Irish Medical Times

circuit breakers aren’t great because server depends on clients to be configured correctly. throttling server-side is better

Circuit-breakers are great, but the service depends on the clients to be configured correctly. A server-side rate-limiting solution is more robust.

Michael Cartmell — Grab

The concept of an ACL-based authorization system is simple enough, but can be a challenge to maintain at scale.

Michael Leong — LinkedIn

We can tell one thing from the outside: it wasn’t a BGP issue.

Alec Pinkham — AppNeta


SRE Weekly Issue #164

A message from our sponsor, VictorOps:

Start making on-call suck less. Last chance to register for the free VictorOps webinar where you can learn about using automation and improved collaboration to create a better on-call experience:


I previously shared an article about the 737 MAX 8, and I’m truly saddened that another accident has occurred. Learning from accidents like this is incredibly important, and the NTSB is among the best at it. I look forward to see what we can take away from this to make air travel even safer.

Farnoush Amiri and Ben Kesslen — NBC

The existence of this anonymous channel for pilots is really interesting to me. It sounds like a great way to learn about near misses, which can be nearly identical to catastrophic accidents. Can we implement this kind of anonymous channel in our organizations too?

Thom Patterson and Aaron Cooper — CNN

“Aviation accidents are rarely the result of a single cause,” Lewis noted. “There are often many small things that lead to a crash, and that’s why these investigations take so long.”

Francesca Paris — NPR

Google and other companies are working on their own private undersea cables.

‘People think that data is in the cloud, but it’s not. It’s in the ocean.’

Adam Satariano — New York Times

For this week, I have an article about on-call and how its done at NASA. Many of the conclusions here may not be that surprising to those who have been on-call for any length of time, but I think there is a lot to learn from how NASA makes the system work.

Thai Wood — Resilience Roundup (summary)

Emily S Patterson and David D Woods — Ohio State University (original article)

I hadn’t thought of this before, but I really like this idea:

The facilitator’s role in the meeting is different from the other participants. They do not voice their own ideas, but keep the discussion on track and encourage the group to speak up.

Rachael Byrne — PagerDuty


SRE WEEKLY © 2015 Frontier Theme