Articles
I’m not a fan of error budgets. I’ve never seen them implemented particularly well up close, though I know lots of folks who say it works for them.
I’ve started to feel a bit sour on the whole error budget thing, but I couldn’t really pin down why. This article really nails it.
Will Gallego
Will Gallego is my co-worker, although I came across this article separately.
I’m still hooked on flight accident case studies. In this one, mission fixation and indecision lead to disaster.
Air Safety Institute
If I was setting up curriculum at a university I’d make an entire semester-long class on The Challenger disaster, and make it required for any remotely STEM-oriented major.
This awesome article is about getting so used to pushing the limits that you forget you’re even doing it, until disaster strikes.
Foone Turing
A couple weeks back, I linked to a survey about compensation for on-call. Here’s an analysis of the results and some raw data in case you want to tinker with it.
Chris Evans and Spike Lindsey
Learn how this company does incident management drills. They seem to handle things much like a real incident, including doing a retrospective afterward!
Tim Little — Kudos
Outages
- Hosted Graphite
- Salesforce
- Salesforce experienced a single-pod outage. Heroku was affected as well.
- eBay
- Southwest Airlines
- Crunchyroll
- Hulu
- Slack
- And this one too. Both contain brief, high-level descriptions of what went wrong.
- Duo Security
- Gmail