Articles
This is an engrossing write-up of the Chernobyl incident from the perspective of complex systems and failure analysis.
Barry O’Reilly
Slack’s Disasterpiece Theater isn’t quite chaos engineering, but it’s arguably better in some ways. They carefully craft scenarios to test their system’s resiliency, verifying (or disproving!) their hypothesis that a given disruption will be handled by the system without an incident. They share three riveting stories of lessons learned from past exercises.
The process each Disasterpiece Theater exercise follows is designed to maximize learning while minimizing risk of a production incident.
Richard Crowley — Slack
The above is the title of this YouTube playlist curated by John Allspaw.
My favorite sentence:
If you think an incident is “too common” to get its own postmortem that’s a good indicator that there’s a deeper issue that we need to address, and an excellent opportunity to apply our postmortem process to it.
Fran Garcia — HostedGraphite
In this post, we’ll share the algorithms and infrastructure that we developed to build a real-time, scalable anomaly detection system for Pinterest’s key operational timeseries metrics. Read on to hear about our learnings, lessons, and plans for the future.
I sure do love a good debugging story.
Eve Harris — Ably
When an incident occurs, your company is faced with a choice: do you seek to learn as much as possible about how it happened, or do you seek to find out who messed up?
Phillip Dowland — Safety Differently