SRE Weekly Issue #9

Articles

I spoke too soon in the last issue! Github has posted an extremely thorough postmortem that answers any questions one might have had about last week’s outage. I like the standard they’re holding themselves to for timely communication:

One of the biggest customer-facing effects of this delay was that status.github.com wasn’t set to status red until 00:32am UTC, eight minutes after the site became inaccessible. We consider this to be an unacceptably long delay, and will ensure faster communication to our users in the future.

Monitoring Business Metrics

Just monitoring servers isn’t enough to detect an outage. Sometimes even detailed service monitoring can miss an overall performance degradation that involves multiple services in an infrastructure. In this blog post, PagerDuty suggests also monitoring key business metrics (logins, purchase rate, etc).

What happened yesterday and what we are doing about it

In this case, “yesterday” is on 2013, but this is an excellent postmortem from Mailgun that can serve as an example for all of us.

Handling an Outage

A customer’s perspective on a datacenter outage, with emphasis on the need for early, frequent, and thorough communication from service providers.

Production Postmortem: the Razor Suicide

A nicely detailed outage postmortem, including the gorey details of the train of thought the engineers followed on the way to a solution. They hint at an important technique that’s not discussed nearly enough, in my opinion: judicious application of bandaid solutions to resolve the outage and allow engineers to continue their interrupted personal time. It’s not necessary to fix a problem the “right” way in the moment, and carefully-applied bandaids help reduce on-call burnout.

The Verification of a Distributed System

How can we be sure (or at least sort of confident) that distributed systems won’t fail? They can be incredibly complex, and their failures can be even more complex. Catie McCaffrey gives us this ACM Queue article about methods for formal and informal verification.

Efficiently testing distributed systems is not a solved problem, but by combining formal verification, model checking, fault injection, unit tests, canaries, and more, you can obtain higher confidence in system correctness.

Public Accountability — Postmortems — Medium

Medium has announced a commitment to publishing postmortems for all outages. I’d love to see more companies making a commitment like this. Thanks to reader Pete Shima for this link.

Outages

Healthplanfinder (WA, US)
- The system went down right before the deadline for users to enroll in plans.
Grindr
- The tweets during this outage were hilarious.
Shaw (ISP)
PlayStation Network
- The third outage this year for PSN.
Virgin Australia
- Another airline grounded by an outage.
British Telecom (UK ISP)
Amazon.com
Google App Engine
Delta
- What is it with airlines lately?
Google Compute Engine
- This one has a nice postmortem.
IRS E-File (US tax system)

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues