This one advocates for looking beyond “root cause” when analyzing an incident, and instead finding Themes and Takeaways.
If it can be solved with a pull request it’s not a takeaway.
Vanessa Huerta Granda — Jeli
In this juicy incident, the Incident Commander’s intimate knowledge of a similar failure mode fixated incident response away from the true cause.
Fred Hebert — Honeycomb
[…] the more we normalize lower-impact incidents, the more confidence and experience we build for Sev1 situations.
Dan Condomitti — The New Stack
Want to compensate folks extra for on-call work? This tool connects to PagerDuty to do all the heavy lifting for you.
Lawrence Jones — incident.io
This Reddit post in r/sre has some really great stories in the comments.
various users — Reddit
Along with the “why”, this article also goes into the “how”.
Martha Lambert — incident.io
Early in my career, I had to write a raw IP packet generator to reproduce a DoS attack so that I could mitigate it. It’s fun!
In an incident in July, a cloud provider change broke provisioning for new Codespaces VMs, taking down the service.
Jakub Oleksy — GitHub
Put Safety First and Minimize
the 12 Common Causes of Mistakes
in the Aviation Workplace
FAA (US’s Federal Aviation Administration)