SRE Weekly Issue #174

A special treat awaits you in the Outages section this week: five awesome incident followups!

A message from our sponsor, VictorOps:

Creating on-call schedules for your SRE team(s) can be challenging. We’ve put together a short list of tips, tricks, and tools you can use to better organize your on-call rotations and help your SRE efforts:

http://try.victorops.com/SREWeekly/SRE-On-Call-Tips

Articles

This is a study of every high severity production incident at Microsoft Azure services over a span of six months, where the root cause of that incident was a software bug.

Adrian Colyer (summary)

Liu et al., HotOS’19 (original paper)

PagerDuty created this re-enactment of an incident response phone bridge. It’s obviously fairly heavily redacted and paraphrased, but it’s still quite educational. It includes interludes where terms such as Incident Commander are explained.

George Miranda — PagerDuty

Outages

  • Google Calendar
  • Netflix
  • Hulu
  • Joyent May 27 2014 outage followup
    • In this 2014 outage followup, we learn that a Joyent engineer accidentally rebooted an entire datacenter:

      The command to reboot the select set of new systems that needed to be updated was mis-typed, and instead specified all servers in the data center.

  • Salesforce May 17 outage followup
    • Click through to read about the massive Salesforce outage last month. A database edit script contained a bug that ran an UPDATE without its WHERE clause, granting elevated permissions to more users than intended. Salesforce shut down broad chunks of their service to prevent data leakage.
  • Second Life mid-May outage followup
    • Linden Lab posted about a network maintenance that went horribly wrong, resulting in a total outage.

      Everything started out great. We got the first new core router in place and taking traffic without any impact at all to the grid. When we started working on the second core router, however, it all went wrong.

      April Linden — Linden Lab

  • Monzo May 30 outage followup
    • Monzo posted this incredibly detailed followup for an outage from several weeks ago. Not only does it give us a lot of insight into their incident response process, I also got to learn about how UK bank transfers work.Thanks to an anonymous reader for this one.

      Nicholas Robinson-Wall — Monzo

  • Google Cloud Platform June 2 outage followup
    • Along with the blog post I linked to last week, Google also posted this technical followup for their major June 2 outage. I’ve never seen one of their followups even close to this long or detailed, and that’s saying a lot.
Updated: June 23, 2019 — 10:22 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme