Last week, I mistakenly attributed [an article](https://www.paigerduty.com/sre-biggest-problem/) to PagerDuty. Actually, it was by Paige Cruz, whose clever blog name I didn’t pay anywhere near close enough attention to! Thanks to several readers that nudged me gently about my error.
If you’ve been in this business long enough, you’ve almost certainly run into an incident where one of the contributors was an implicit invariant that was violated by a new change.
Easily the majority of incidents I’ve been in.
Lorin Hochstein
This article is about trying to solve for this problem:
a potentially significant number of customers or queries can be affected by an outage and this won’t trigger an SLO violation.
Niall Murphy
A surgeon struggles with the difficulties in building a culture of retrospectives and introspection in their surgical team, by running a fascinating retro on himself in this blog post.
Robert Poston, MD
An argument for buying yourself time to slow down and make decisions carefully, as a way of ultimately speeding up incident resolution.
Shayon Mukherjee
Disasters threatening a business’ ability to operate core functions don’t occur that often (phew!), but we do want to ensure we are prepared to keep our business running if they do. To practice disaster response skills, we run business continuity drills, and you can too with our 10-step plan!
Janna Brummel — WeTransfer
How people think about reliability varies between companies. Which of the four different perspectives laid out int his article does your company fit into, if any?
Ross Brodbeck
Honeycomb posted this followup on their April 9 outage, explaining what went wrong and how they’re responding.
Honeycomb
Full disclosure: Honeycomb is my employer.
The author of this article posed a question on r/sre:
What matters most for your success as an SRE?
They share a summary of the answers they got, with their commentary.
Nočnica Mellifera — Checkly