Articles
Don’t scale up farther than you need to! If you won’t ever see more than 100 RPS, don’t architect for 100,000.
Ayende Rahien
This one covers several common patterns of SRE practice and then offers insight on what to look for as you design your own SRE team.
Rob Cummings — Slalom Build
Abstractions make us more productive, and, indeed, we humans can’t build complex systems without them. But we need to be able to peel away the abstraction layers when things go wrong, so we can discover the implicit precondition that’s been violated.
Lorin Hochstein
Coordination between nodes in a distributed system can kill performance. What kinds of problems require coordination? The CALM theorem can tell us.
Joseph M. Hellerstein and Peter Alvaro — Communications of the ACM
Here’s another good post-incident analysis document template that you can use as inspiration for your own.
Hannah Culver — Blameless
As your product ages, it transitions from “cool new thing” to “tool everyone uses and expects to Just Work”. Your reliability needs will change accordingly.
Lyon Wong — Blameless
Outages
- PagerDuty
- 95% of event submissions (your systems telling PagerDuty to trigger an alert) failed for about an hour. They posted some detail about what went wrong.
- Slack
- Their latest update on this outage contains some detail about what went wrong.
- Telegram
- Microsoft Office 365
- Coles Supermarkets
- Adobe Creative Cloud
- GitHub