Articles
Well, that cleared things up. (It didn’t, but the debate is interesting).
Scott M. Fulton III — The New Stack
This article has five tips for great incident communication, along with a section on why this matters.
Luis Gonzalez — incident.io
Beyond just a list of ways SREs interface with other teams, this article also compares them and gives advantages and disadvantages of each.
Amin Astaneh — Certo Modo
Building every system to be strong enough to handle peak load can be very expensive. Can we instead build our systems to take excess load from each other cooperatively?
Lorin Hochstein — Surfing Complexity
Another useful “how we do SRE” post, including an incident report template.
Pavel Pritchin — Dodo Engineering
Here’s an interesting twist on the usual “incident severity 101” article: in a company where “anyone can declare an incident”, how do you make sure incident severity gets set consistently in every incident?
Mike Lacsamana — FireHydrant
How can we work to improve reliability when folks perceive our efforts to be counter to velocity?
Code Reliant
In a blameless culture without consequences, what’s the incentive for learning to make the system more reliable? This is an incredibly thought-provoking article and I’m still not sure how I feel about it.
Robert Poston MD