Ever had a Sev 1 non-impacting incident? This team’s Consul cluster was balanced on a razor’s edge: one false move and quorum would be lost. Read about their incident response and learn how they avoided customer impact.
Devin Sylva — GitLab
This SRECon EMEA highlight reel is giving me serious FOMO.
Will Sewell — Pusher
This week we’re taking a look at how teams in high consequence domains perform handoffs between shifts.
Emily Patterson, Emilie Roth, David Woods, and Renee Chow (original paper)
Thai Wood (summary)
This is an interesting essay on handling errors in complex systems.
In other words, the trick to scaling in the presence of errors is building software around the notion of recovery. Automated recovery.
To be clear: this is about assisting incident responders in gaining an understanding of an incident in the moment, not about finding a “root cause” to present in an after-action report.
I’m not going to pretend to understand the math, but the concept is intriguing.
Nikolay Pavlovich Laptev, Fred Lin, Keyur Muzumdar, Mihai-Valentin Curelea, Seunghak Lee, and Sriram Sankar — Facebook
This one’s about assisting humans in debugging, when they have a reproduction case for a bug but can’t see what’s actually going wrong.
That’s two different uses of “root cause” this week, and neither one is the troublesome variety that John Allspaw has debunked repeatedly.
Zhang et al. (original paper)
Adrian Colyer (summary)
- Here‘s an unroll of an interesting Twitter thread by Honeycomb’s Liz Fong-Jones during and after the incident.
- Amazon Prime Video
- Google Compute Engine
- Network administration functions were impacted. Click for their post-incident analysis.
On Wednesday November 6th, many Squarespace websites were unavailable for 102 minutes between 14:13 and 15:55 ET.
Click through for their post-incident analysis.