This reads like a mini list of war stories from a grizzled veteran reliability engineer… because that’s exactly what it is. Don’t forget to click the link at the bottom for the followup post!
- Add Redundancy
- Avoid Risk
- Enforce Procedures
- Defend against Prior Root Causes
- Document Best Practices and Runbooks
- Remove the People Who Cause Accidents
If that doesn’t make you want to read this, I don’t know what will.
Casey Rosenthal — Verica
The graveyard that no one dared tread in was the Terraform code. Once they got CI/CD set up, deploys became much easier — and less scary.
Liz Fong-Jones — Honeycomb
My favorite idea in this article is that the absence of “errors” is not the same thing as safety.
Thai Woods (summary)
Sidney Dekker (original paper)
High availability and resilience are key features of Kubernetes. But what do you do when your Kubernetes cluster starts to become unstable and it looks like your ship is starting to sink?
Tim Little — Kudos
- We had issues with Monzo on 29th July. Here’s what happened, and what we did to fix it.
At this point, we’ve confirmed that something we thought was impossible, had in fact happened.
I know the feeling, folks.
- Heroku Incident #1819 follow-up
- Heroku’s API service degraded when its external error logging provider suffered an outage.
- Halifax and Lloyds (bank)
- Facebook, Instagram, and WhatsApp
- Google search indexing
- British Airways