Articles
This delightful talk explores what SRE can look like in practical terms by learning about the sociotechnical situation at a fictitious company. To do that, Amy Tobey plays a game she created, walking through a town and talking to NPCs.
Amy Tobey — InfoQ
Honeycomb had a major outage last tuesday, and they posted this interim outage report on their status page.
Note: Honeycomb is my employer, and I proofread this article.
Honeycomb
The system resiliency pyramid provides a holistic framework for thinking about reliability across five key layers.
I like the way this system of layers breaks down the multiple different aspects of reliability.
Code Reliant
This article explores system overload using a traffic congestion analogy. I especially like the note about failover as a cause of an overload condition.
Tanveer Gill — FluxNinja
in this article, I’ll dive into this vital DORA metric, detail its benchmarks, and provide practical insights to help you drive more frequent successful changes.
incident.io
This article explains four different rate limiting algorithms and includes code snippets in Java.
Code Reliant
PostgreSQL vacuuming can be a total pain — and a serious threat to performance and reliability. This new database engine sounds pretty interesting.
Oriole
Current IaC tools are like plain HTML, says this author, and we should have something like CSS to avoid repeating ourselves.
Nathan Peck
PagerDuty looks back on a decade of weekly chaos experiments and shares advice on starting your own similar program.
Cristina Dias — PagerDuty