A special treat awaits you in the Outages section this week: five awesome incident followups!
Articles
This is a study of every high severity production incident at Microsoft Azure services over a span of six months, where the root cause of that incident was a software bug.
Adrian Colyer (summary)
Liu et al., HotOS’19 (original paper)
PagerDuty created this re-enactment of an incident response phone bridge. It’s obviously fairly heavily redacted and paraphrased, but it’s still quite educational. It includes interludes where terms such as Incident Commander are explained.
George Miranda — PagerDuty
Outages
- Google Calendar
- Netflix
- Hulu
- Joyent May 27 2014 outage followup
- In this 2014 outage followup, we learn that a Joyent engineer accidentally rebooted an entire datacenter:
The command to reboot the select set of new systems that needed to be updated was mis-typed, and instead specified all servers in the data center.
- In this 2014 outage followup, we learn that a Joyent engineer accidentally rebooted an entire datacenter:
- Salesforce May 17 outage followup
- Click through to read about the massive Salesforce outage last month. A database edit script contained a bug that ran an
UPDATE
without itsWHERE
clause, granting elevated permissions to more users than intended. Salesforce shut down broad chunks of their service to prevent data leakage.
- Click through to read about the massive Salesforce outage last month. A database edit script contained a bug that ran an
- Second Life mid-May outage followup
- Linden Lab posted about a network maintenance that went horribly wrong, resulting in a total outage.
Everything started out great. We got the first new core router in place and taking traffic without any impact at all to the grid. When we started working on the second core router, however, it all went wrong.
April Linden — Linden Lab
- Linden Lab posted about a network maintenance that went horribly wrong, resulting in a total outage.
- Monzo May 30 outage followup
- Monzo posted this incredibly detailed followup for an outage from several weeks ago. Not only does it give us a lot of insight into their incident response process, I also got to learn about how UK bank transfers work.Thanks to an anonymous reader for this one.
Nicholas Robinson-Wall — Monzo
- Monzo posted this incredibly detailed followup for an outage from several weeks ago. Not only does it give us a lot of insight into their incident response process, I also got to learn about how UK bank transfers work.Thanks to an anonymous reader for this one.
- Google Cloud Platform June 2 outage followup
- Along with the blog post I linked to last week, Google also posted this technical followup for their major June 2 outage. I’ve never seen one of their followups even close to this long or detailed, and that’s saying a lot.