Thanks to all of you that took the time to share your ideas about choosing incidents to investigate! I got some great answers and I’m looking forward to pulling them together into an article.
I decided to give this GPT-3 thing a spin. It turns out that it absolutely can assemble a newsletter with links to the week’s top SRE stories, each with a short description. It even includes authors. The authors are even real people. The URLs, though… well, they look real, but they’re mysteriously all 404s, and the articles don’t actually exist. Guess you’re stuck with me for now!
This article takes the idea of “internal customers” to its logical conclusion, by treating the platform in the same way as a startup company.
Adam Buggia — Sym
This article uses nifty probability formulas to show that blaming an engineer for an incident may well result in diminished reliability and efficiency.
Here’s a report on the CircleCI security incident at the start of the year. There’s some good stuff in there about not blaming the specific engineer whose device was attacked.
Rob Zuber — CircleCI
A hot take on how not to measure your incident response process.
Fred Hebert — Honeycomb
Full disclosure: Honeycomb is my employer.
eBay’s notification platform team built a fault-tolerant, resilient system by injecting faults in the application level.
Wei Chen — eBay
This one succinctly sums up why I haven’t covered the NOTAM outage much yet.
If a small mistake was sufficient to take down a complex system, then our systems would be crashing all of the time.
Don’t you love when merely running
strace fixes the problem?
This air accident seems at its face to be a clear-cut story of negligence. There’s far more to it, and the author goes into detail on why blaming the captain can damage air safety industry-wide.