General

SRE Weekly Issue #167

A message from our sponsor, VictorOps:

[You’re Invited] Death to Downtime: How to Quantify and Mitigate the True Costs of Downtime. VictorOps and Catchpoint are teaming up for a live webinar on 5 monitoring and incident response best practices for preventing outages.

http://try.victorops.com/sreweekly/death-to-downtime-webinar

Articles

This is an awesome write-up of SRECon, but the part I really love is the intro. The author gives voice to a growing tension I’ve seen in our field, as we try to adopt the tenets of Safety II which can seem to be at odds with traditional SRE practices. There’s a lot here that we SREs need to work out as our profession matures, and I’m really enjoying the process.

Tanya Reilly

Experts recommend trying to keep the concepts of blame, root cause, and hindsight bias out of our retrospective investigations. This insightful article explains that they all stem from the illusion that we are in full control of our systems.

Thanks to Will Gallego for this one.

Ryan Frantz

Here’s a top-notch followup analysis from Mailchimp on the Mandrill outage last month. Their Postgresql DB ran out of transaction IDs (a common failure mode), causing a painful outage. Tons of great stuff here including a mention of rotating ICs every 3 hours to prevent exhaustion and allow them to sleep.

Mailchimp

And here’s where things get really interesting. Incidents are never as simple as they seem from the outside, and the 737 MAX situation is no exception. I anxiously await the full report, in which we’ll hear more about the confluence of contributing factors that must have been involved here.

Thom Patterson — CNN

There’s a lot in this, and I don’t feel comfortable summarizing it with a little blurb about lessons learned. Chilling though it is, I’m glad I read it.

Thanks to Sri Ray for this one.

Patrick Smith — The Telegraph

I consider a system to production ready when it has, not error handling inside a particular component, but actual dedicated components related to failure handling (note the difference from error handling), management of failures and its mitigations.

Ayende Rahien

Outages

SRE Weekly Issue #166

SRECon was amazing! The talk line-up was mind-blowing, and it was great to meet many of you there. A big thanks to all the speakers for making this one a conference to remember.

A message from our sponsor, VictorOps:

In case you missed it, check out the recording of the recent VictorOps webinar, How to Make On-Call Suck Less. The webinar covers 5 actionable steps every SRE team can take to improve alerting and make on-call suck less:

http://try.victorops.com/sreweekly/making-on-call-suck-less

Articles

One of my favorite moments of SRECon: during their talk, Dorothy Jung and Wenting Wang unveiled this choose-your-own-adventure-style game for practicing your incident response skills. See if you can resolve the incident before your stress level gets too high!

Chie Shu, Dorothy Jung, Joel Salas, Dennis So, Sam Faber-Manning, and Wenting Wang — Yelp

Last week was only the second SRECon I’ve managed to attend. Rather than post raw notes from all the talks I attended, I tried something different: I only wrote down the really big stuff that made me think or blew my mind. I’m hoping that just reading this might give those of you that weren’t able to attend a taste of the conference.

Lex Neva

Inspired by SRECon, John Allspaw posted this Twitter thread on the “Humans Are Better At” / “Machines Are Better At” concept.

Who will argue with “make the computers do the easy/tedious stuff so humans can do the difficult/interesting stuff”? (apparently, I will)

John Allspaw

This article goes into what the pilots of the Lion Air 737 Max 8 (and presumably the Ethiopian Airlines one as well) would have had to do in order to regain control over the aircraft. We’re starting to get hints of the task saturation and alert overload both sets of pilots may have faced as they tried to handle the situation:

The Lion Air crew would have had to accomplish this while dealing with a host of alerts, including differences in other sensor data between the pilot and co-pilot positions that made it unclear what the aircraft’s altitude was.

Thanks to Courtney Eckhardt for this one.

Sean Gallagher — Ars Technica

The day before Lion Air’s 737 Max 8 crash last fall, the exact same plane had a similar failure to the one that may have taken that plane down the next day.

Thanks to Courtney Eckhardt for this one.

Alan Levin and Harry Suhartono — Bloomberg

Calvin is interesting for (at least) two reasons: first, it’s designed to work with an existing database, and second, it manages an impressively fast transaction throughput rate.

Adrian Colyer (summary) — The Morning Paper

Thomson et al. (original paper)

This article draws an interesting parallel between two talks at SRECon last week, about making sure that your monitoring doesn’t itself cause incidents.

Beth Pariseau — TechTarget

Outages

SRE Weekly Issue #165

As I write this, I’m headed to New York City for SRECon19 Americas, and I can’t wait!  If you’re there, come hit me up for some SRE Weekly swag, made using open source software.

A message from our sponsor, VictorOps:

Reducing MTTA and MTTR takes 5 simple steps. Check out this recent blog series, Reducing MTTA, to find 5 simple steps for improving incident response, lowering MTTA over time and making on-call suck less for DevOps and SRE teams:

http://try.victorops.com/sreweekly/reducing-mtta-alerts

Articles

As we discover more about the Boeing 737 MAX accidents, this author trolled through the ASRS database looking for related complaints.

Thanks to Greg Burek for this one.

James Fallows — The Atlantic

Learn about ASRS, the Aviation Safety Reporting System. Pilots and other aviation crew can report concerns anonymously, and the results are summarized regularly and reported to the FAA, NTSB, and other organizations.

Thanks to Greg Burek for this one.

Jerry Colen — NASA

I caught wind of a previous Boeing 737 issue from the 90s during a personal conversation this week. There’s an interesting parallel to the current 737 MAX issue, as Boeing blamed pilots for incorrectly responding to a “normal” flight incident for which pilots are routinely trained.

Various — Wikipedia

Dr Justine Jordan gives a personal account of how on-duty napping during extended overnight in-hospital duty hours as a trainee doctor eased her fatigue levels and raised her state of alertness

Dr. Justine Jordan — Irish Medical Times

circuit breakers aren’t great because server depends on clients to be configured correctly. throttling server-side is better

Circuit-breakers are great, but the service depends on the clients to be configured correctly. A server-side rate-limiting solution is more robust.

Michael Cartmell — Grab

The concept of an ACL-based authorization system is simple enough, but can be a challenge to maintain at scale.

Michael Leong — LinkedIn

We can tell one thing from the outside: it wasn’t a BGP issue.

Alec Pinkham — AppNeta

Outages

SRE Weekly Issue #164

A message from our sponsor, VictorOps:

Start making on-call suck less. Last chance to register for the free VictorOps webinar where you can learn about using automation and improved collaboration to create a better on-call experience:

http://try.victorops.com/sreweekly/how-to-make-on-call-suck-less

Articles

I previously shared an article about the 737 MAX 8, and I’m truly saddened that another accident has occurred. Learning from accidents like this is incredibly important, and the NTSB is among the best at it. I look forward to see what we can take away from this to make air travel even safer.

Farnoush Amiri and Ben Kesslen — NBC

The existence of this anonymous channel for pilots is really interesting to me. It sounds like a great way to learn about near misses, which can be nearly identical to catastrophic accidents. Can we implement this kind of anonymous channel in our organizations too?

Thom Patterson and Aaron Cooper — CNN

“Aviation accidents are rarely the result of a single cause,” Lewis noted. “There are often many small things that lead to a crash, and that’s why these investigations take so long.”

Francesca Paris — NPR

Google and other companies are working on their own private undersea cables.

‘People think that data is in the cloud, but it’s not. It’s in the ocean.’

Adam Satariano — New York Times

For this week, I have an article about on-call and how its done at NASA. Many of the conclusions here may not be that surprising to those who have been on-call for any length of time, but I think there is a lot to learn from how NASA makes the system work.

Thai Wood — Resilience Roundup (summary)

Emily S Patterson and David D Woods — Ohio State University (original article)

I hadn’t thought of this before, but I really like this idea:

The facilitator’s role in the meeting is different from the other participants. They do not voice their own ideas, but keep the discussion on track and encourage the group to speak up.

Rachael Byrne — PagerDuty

Outages

SRE Weekly Issue #163

A message from our sponsor, VictorOps:

Being on-call sucks. To make it better, sign up for the free webinar, “How to Make On-Call Suck Less”, to learn 5 simple steps you can take to improve the on-call experience and become a more efficient SRE team:

http://try.victorops.com/sreweekly/how-to-make-on-call-suck-less

Articles

Using an NTSB report on an airplane crash as a case study, this video presents three common traps we fall into in incident retrospectives:

  1. Counterfactual reasoning
  2. Normative language
  3. Mechanistic reasoning

I want to make this required material for all retrospective participants.

Dr. Johan Bergström — Lund University

Peak-shifting can save you and your customers money and make load easier to handle.

Lara PuReum Yim, Prashant Kumar, Raghav Garg, Preeti Kotamarthi, Ajmal Afif, Calvin Ng Tjioe, and Renrong Weng — Grab

These folks structured their on-call and incident response process around wombats (and sound guidelines and playbooks).

Wes Mason — npm

Lots of great stuff in this case study on an incident involving Chef and Apache. My favorite:

Enforcing processes arbitrarily as a way of trying to increase robustness often ends up making systems more fragile, as people will find ways to work around processes that frustrate their goals

Ryn Daniels — HashiCorp

Here’s how and why Grab rebuilt their logging pipeline around structured JSON events.

Aditya Praharaj — Grab

Don Miguel Ruiz’s Four Agreements as applied to incident response:

  1. Be Impeccable With Your Word
  2. Don’t Take Anything Personally
  3. Don’t Make Assumptions
  4. Always Do Your Best

Matt Stratton — PagerDuty

Outages

A production of Tinker Tinker Tinker, LLC Frontier Theme