I have a special treat for you this week: 7 detailed incident reports! Just a note, I’ll be on vacation next week, so I’ll see you in two weeks on August 23.
The lead SRE at Under Armour(!) has a ton of interesting things to share about how they do SRE. I love their approach to incident retrospectives that starts with 1:1 interviews with those involved.
Paul Osman — Under Armour (Blameless Summit)
A routine infrastructure maintenance had unintended consequences, saturating MySQL with excessive connections.
Daniel Messer — RedHat
This report details the complex factors that contributed to the failure of a dam in Michigan in May of this year.
Jason Hayes — Mackinac Center for Public Policy
This incident involved a DNS failure in Heroku’s infrastrucure provider (presumably AWS).
This incident at LinkedIn impacted multiple internal customers with varying requirements for durability and latency, making recovery complex.
Sandhya Ramu and Vasanth Rajamani — LinkedIn
This report includes a description of an incident involving Kubernetes pods and an impaired DNS service.
Keith Ballinger — GitHub
In this report, Honeycomb describes how they investigated an incident from the prior week that their monitoring had missed.
Martin Holman — Honeycomb