It’s hard to summarize all the awesome here, but it boils down to empathy, collaboration, and asking, “How can I help?”. These pay dividends all over an organization, especially in reliability.
Note: Will Gallego is my coworker, although I came across this post on my own.
This followup post for a Google Groups outage was (fittingly) hidden away in a Google Group.
Thanks to Jonathan Rudenberg for this one.
Now I can link directly to specific incidents! I miss the graphs, though.
Jamie Hannaford — GitHub
I laughed so hard I scared my cats:
COWORKER: we need to find the root cause asap
ME: takes long drag the root cause is that our processes are not robust enough to prevent a person from making this mistake
COWORKER: amy please not right now”
Great discussion in the thread!
In Air Traffic Control parlance, if a pilot or controller can’t satisfy with a request, they should state that they are “unable” to comply. It can be difficult to decide in the moment what one is truly “unable” to do. There are a lot of great lessons here that apply equally well to IT incident response.
Tarrance Kramer — AVweb
The common theme at KubeCon is that SRE teams at many companies produce reliable, reusable patterns for their developers to build with.
Beth Pariseau — TechTarget
This is the story of a tenacious fight to find out what went wrong during an incident. If you read nothing else, the Conclusion section has a lot of great tidbits.
Tony Meehan — Endgame
Here’s a new guide on how to apply Restorative Just Culture. This made me laugh:
They also fail to address the systemic issues that gave rise to the harms caused, since they reduce an incident to an individual who needs to be ‘just cultured’.
Sidney Dekker — Safety Differently