- Acknowledge human error as a given and aim to compensate for it
- Conduct blameless post-mortems
- Avoid the “deadly embrace”
- Favor decentralized IT architectures
There have been quite a few of these “lessons learned” articles that I’ve passed over, but I feel like this one is worth reading.
Anurag Gupta — Shoreline.io
Could us-east-1 go away? What might you do about it? Let’s catastrophize!
I love catastrophizing!
When evaluating options, this article focuses on reliability, both of the service itself and the options it provides for building reliable services on it.
Quentin Rousseau — Rootly
This article is published by my sponsor, Rootly, but their sponsorship did not influence its inclusion in this issue.
This one answers the questions: what are failure domains, and how can we structure them to improve reliability?
It’s a great list of questions, and it covers a lot of ground. SREs wear many hats.
I’ve always been curious about how Prometheus and similar time-series DBs compress metric data. Now I know!
Alex Vondrak — Honeycomb
This one has some unconfirmed (but totally plausible!) deeper details about what might have gone wrong in the Facebook outage, sourced from rumors.
There’s a really intriguing discussion in here about why organizations might justify a choice of profit at the expense of safety, and how the deck is stacked.