SRE Weekly Issue #167

Articles

Conference Report: SRECon Americas 2019

This is an awesome write-up of SRECon, but the part I really love is the intro. The author gives voice to a growing tension I’ve seen in our field, as we try to adopt the tenets of Safety II which can seem to be at odds with traditional SRE practices. There’s a lot here that we SREs need to work out as our profession matures, and I’m really enjoying the process.

Tanya Reilly

Control Is an Illusion

Experts recommend trying to keep the concepts of blame, root cause, and hindsight bias out of our retrospective investigations. This insightful article explains that they all stem from the illusion that we are in full control of our systems.

Thanks to Will Gallego for this one.

Ryan Frantz

What We Learned from the Recent Mandrill Outage

Here’s a top-notch followup analysis from Mailchimp on the Mandrill outage last month. Their Postgresql DB ran out of transaction IDs (a common failure mode), causing a painful outage. Tons of great stuff here including a mention of rotating ICs every 3 hours to prevent exhaustion and allow them to sleep.

Mailchimp

Ethiopian crash: Boeing 737 Max pilots followed expected procedures, aviation officials say

And here’s where things get really interesting. Incidents are never as simple as they seem from the outside, and the 737 MAX situation is no exception. I anxiously await the full report, in which we’ll hear more about the confluence of contributing factors that must have been involved here.

Thom Patterson — CNN

The true story behind the deadliest air disaster of all time

There’s a lot in this, and I don’t feel comfortable summarizing it with a little blurb about lessons learned. Chilling though it is, I’m glad I read it.

Thanks to Sri Ray for this one.

Patrick Smith — The Telegraph

Production ready code is much more than error handling – Ayende @ Rahien

I consider a system to production ready when it has, not error handling inside a particular component, but actual dedicated components related to failure handling (note the difference from error handling), management of failures and its mitigations.

Ayende Rahien

Outages

Travis CI
Slack
- And this one.
Google Cloud Platform (us-central1)
Heroku
Instagram
Squarespace
- Click for another A+ followup analysis from Squarespace. Thanks, folks!

SRE Weekly Issue #167

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

A message from our sponsor, VictorOps:

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues