Articles
In this story of SLOs gone bad, error budgets and code freezes provided a perverse incentive that caused a great deal of harm.
dobbse.net
This article seeks to apply SRE principles to security in the form of a Threat Budget.
Jason Bloomberg — Intellyx
After talking to hundreds of engineers about their processes, we’ve identified five of the most common challenges we see across companies looking to put more structure behind how they manage their incidents.
Mike Lacsamana — FireHydrant
The Analysis section has a lot of important lessons. What really stands out in this incident review is the fact that Honeycomb plainly lays out the fact that they don’t yet know what went wrong, and why not.
Fred Hebert — Honeycomb
Full disclosure: Honeycomb is my employer.
several, small staging clusters—each fit for their purpose—offers a more maintainable, cheaper alternative.
Tyler Cipriana
I’m really enjoying the Admiral Cloudberg series of aircraft accident investigation reports. How did I not know about these before??
A lot has improved in aviation safety since this crash in 1967, but there’s still a lot we can learn in SRE even now. For example: the operator’s view into the system should make the result of their inputs clear.
Admiral Cloudberg
An unannounced (maybe inadvertent?) breaking change in an Azure API caused an outage. Here’s the story of the investigation.
Nikko Campbell — Metrist
Another Admiral Cloudberg air accident investigation, this time showing how easily critical details can slip through the cracks.
Admiral Cloudberg