SRE Weekly Issue #167

A message from our sponsor, VictorOps:

[You’re Invited] Death to Downtime: How to Quantify and Mitigate the True Costs of Downtime. VictorOps and Catchpoint are teaming up for a live webinar on 5 monitoring and incident response best practices for preventing outages.

http://try.victorops.com/sreweekly/death-to-downtime-webinar

Articles

This is an awesome write-up of SRECon, but the part I really love is the intro. The author gives voice to a growing tension I’ve seen in our field, as we try to adopt the tenets of Safety II which can seem to be at odds with traditional SRE practices. There’s a lot here that we SREs need to work out as our profession matures, and I’m really enjoying the process.

Tanya Reilly

Experts recommend trying to keep the concepts of blame, root cause, and hindsight bias out of our retrospective investigations. This insightful article explains that they all stem from the illusion that we are in full control of our systems.

Thanks to Will Gallego for this one.

Ryan Frantz

Here’s a top-notch followup analysis from Mailchimp on the Mandrill outage last month. Their Postgresql DB ran out of transaction IDs (a common failure mode), causing a painful outage. Tons of great stuff here including a mention of rotating ICs every 3 hours to prevent exhaustion and allow them to sleep.

Mailchimp

And here’s where things get really interesting. Incidents are never as simple as they seem from the outside, and the 737 MAX situation is no exception. I anxiously await the full report, in which we’ll hear more about the confluence of contributing factors that must have been involved here.

Thom Patterson — CNN

There’s a lot in this, and I don’t feel comfortable summarizing it with a little blurb about lessons learned. Chilling though it is, I’m glad I read it.

Thanks to Sri Ray for this one.

Patrick Smith — The Telegraph

I consider a system to production ready when it has, not error handling inside a particular component, but actual dedicated components related to failure handling (note the difference from error handling), management of failures and its mitigations.

Ayende Rahien

Outages

Updated: April 7, 2019 — 8:46 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme