SRE Weekly Issue #9

Articles

I spoke too soon in the last issue! Github has posted an extremely thorough postmortem that answers any questions one might have had about last week’s outage. I like the standard they’re holding themselves to for timely communication:

One of the biggest customer-facing effects of this delay was that status.github.com wasn’t set to status red until 00:32am UTC, eight minutes after the site became inaccessible. We consider this to be an unacceptably long delay, and will ensure faster communication to our users in the future.

Just monitoring servers isn’t enough to detect an outage. Sometimes even detailed service monitoring can miss an overall performance degradation that involves multiple services in an infrastructure. In this blog post, PagerDuty suggests also monitoring key business metrics (logins, purchase rate, etc).

In this case, “yesterday” is on 2013, but this is an excellent postmortem from Mailgun that can serve as an example for all of us.

A customer’s perspective on a datacenter outage, with emphasis on the need for early, frequent, and thorough communication from service providers.

A nicely detailed outage postmortem, including the gorey details of the train of thought the engineers followed on the way to a solution. They hint at an important technique that’s not discussed nearly enough, in my opinion: judicious application of bandaid solutions to resolve the outage and allow engineers to continue their interrupted personal time. It’s not necessary to fix a problem the “right” way in the moment, and carefully-applied bandaids help reduce on-call burnout.

How can we be sure (or at least sort of confident) that distributed systems won’t fail? They can be incredibly complex, and their failures can be even more complex. Catie McCaffrey gives us this ACM Queue article about methods for formal and informal verification.

Efficiently testing distributed systems is not a solved problem, but by combining formal verification, model checking, fault injection, unit tests, canaries, and more, you can obtain higher confidence in system correctness.

Medium has announced a commitment to publishing postmortems for all outages. I’d love to see more companies making a commitment like this. Thanks to reader Pete Shima for this link.

Outages

Updated: February 7, 2016 — 8:40 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme