The point of this thread is to bring attention to the notion that our reactions to surprising events are the fuel that effectively dictates what we learn from them.
John Allspaw — Adaptive Capacity Labs
This article is an attempt to classify the causes of major outages at the big three cloud providers (AWS, Azure, and GCP).
It was, wasn’t it? Here’s a nice summary of the recent spate of unrelated major incidents.
Zack Whittaker — TechCrunch
Calculating CIRT (Critical Incident Response Time) involves ignoring various types of incidents to try to get a number that is more representative of the performance of an operations team.
Julie Gunderson, Justin Kearns, and Ophir Ronen — PagerDuty
There is so much great detail in this followup article about Cloudflare’s global outage earlier this month. Thanks, folks!
John Graham-Cumming — Cloudflare
- Nordstrom’s site went down at the start of a major sale.
- Honeycomb had an 8-minute outage preceded by 4 minutes of degradation. Click through to find out how their CI pipeline surprised them and what they did about it.
- Australian Tax Office
[…] two different database bugs and a configuration change interacted in an unforeseen way, causing a cascading failure across several critical services.
Click through for Stripe’s full analysis.