Rollbacks don’t always return you to a previous system state. They can return you to a state you’ve never tested or operated before.
Steve Fenton — Octopus Deploy
This article explains the math of burn rate alerting and gives well thought out reasoning or why burn rates are better.
James Frullo — Datadog
This hot take is worth thinking about: what do you want to get out of assigning incident severity levels, and is it working?
Hamed Silatani — Uptime Labs
Less defense, and more about how to best cope with a code freeze and avoid the downsides when you’ve got no choice.
Tom Elliott
MTTI in this case is Mean Time to Isolate. How long are you taking to figure out what system component is at the heart of an incident? What does MTTI say about your system, and what can you do about it?
Old School Burke
This article doesn’t answer the question in its title concretely, but it does give one a lot to think about. It also shares some ideas for how to cope with the potential challenges identified.
Sylvain Kalache — LeadDev
This one starts off as a review of a workbook on root cause analysis by the UK Health and Safety Executive. Then it raises concerns about RCA-based reasoning and contrasts with a different model based on resilience engineering.
Lorin Hochstein
I wrote this article in response to Azure’s post, Introducing Azure SRE Agent. There’s a lot we can learn from the example agent interactions that Microsoft chose to share.
Lex Neva