SRE Weekly Issue #174

A special treat awaits you in the Outages section this week: five awesome incident followups!

Articles

What bugs cause cloud production incidents?

This is a study of every high severity production incident at Microsoft Azure services over a span of six months, where the root cause of that incident was a software bug.

Adrian Colyer (summary)

Liu et al., HotOS’19 (original paper)

Listen to a Recorded Incident Response Call

PagerDuty created this re-enactment of an incident response phone bridge. It’s obviously fairly heavily redacted and paraphrased, but it’s still quite educational. It includes interludes where terms such as Incident Commander are explained.

George Miranda — PagerDuty

Outages

Google Calendar
Netflix
Hulu
Joyent May 27 2014 outage followup
- In this 2014 outage followup, we learn that a Joyent engineer accidentally rebooted an entire datacenter:
  
  The command to reboot the select set of new systems that needed to be updated was mis-typed, and instead specified all servers in the data center.
Salesforce May 17 outage followup
- Click through to read about the massive Salesforce outage last month. A database edit script contained a bug that ran an UPDATE without its WHERE clause, granting elevated permissions to more users than intended. Salesforce shut down broad chunks of their service to prevent data leakage.
Second Life mid-May outage followup
- Linden Lab posted about a network maintenance that went horribly wrong, resulting in a total outage.
  
  Everything started out great. We got the first new core router in place and taking traffic without any impact at all to the grid. When we started working on the second core router, however, it all went wrong.
  
  April Linden — Linden Lab
Monzo May 30 outage followup
- Monzo posted this incredibly detailed followup for an outage from several weeks ago. Not only does it give us a lot of insight into their incident response process, I also got to learn about how UK bank transfers work.Thanks to an anonymous reader for this one.
  Nicholas Robinson-Wall — Monzo
Google Cloud Platform June 2 outage followup
- Along with the blog post I linked to last week, Google also posted this technical followup for their major June 2 outage. I’ve never seen one of their followups even close to this long or detailed, and that’s saying a lot.

SRE Weekly Issue #174

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

A message from our sponsor, VictorOps:

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues