SRE Weekly Issue #59

Much like I did with telecoms, I’ve decided that it’s time to stop posting every MMO game outage that I see go by.  They rarely share useful postmortems and they’re frequently the target of DDoS attacks.  If I see an intriguing one go by though, I’ll be sure to include it.

SPONSOR MESSAGE

Interested in ChatOps? Get the free 75 page O’Reilly report covering everything from basic concepts to deployment strategies. http://try.victorops.com/sreweekly/chatops

Articles

Here’s a great article about burnout in the healthcare sector. There’s mention of second victims (see also Sydney Dekker) and a vicious circle: burnout leads to mistakes, which lead to adverse patient outcomes, which lead to guilt and frustration, which leads to burnout.

Every week, I find and ignore at least one bland article about the “huge cost of downtime”. They almost never have anything interesting or new to say. This article by PagerDuty takes a different approach that I find refreshing, starting off by defining “downtime” itself.

A frustrated CEO speaks out against AWS’s infamously sanguine approach to posting on their status site.

As mentioned last week, here’s the final, published version of GitLab’s postmortem for their incident at the end of last month.

An ideal environment is one in which you can make mistakes but easily and quickly recover from them with minimal to no impact.

MongoDB contracted Jepsen to test their new replication protocol. Jepsen found some issues, which are fixed, and now MongoDB gets a clean bill of health. Pretty impressive! Even cooler is that the Mongo folks have integrated Jepsen’s tests into their CI.

Outages

  • Instapaper
    • Instapaper hit a performance cliff with their database, and the only path forward was to dump all data and load it into a new, more powerful DB instance.
  • Google Cloud Status Dashboard
    • Google released a postmortem for a network outage at the end of January.
  • OWASA (Orange County, FL, USA water authority)
    • OWASA had to cut off the municipal water supply for 3 days after an accidental overfeed of fluoride into the drinking supply. They engaged in an impressive post-analysis and released a detailed root cause analysis document. It was a pretty interesting read, and I highly recommend clicking through to the PDF and reading it.  There you’ll see that “human error” was a proximal but by no means root cause of the outage, especially since the human in question corrected their error after just 12 seconds.
Updated: February 12, 2017 — 8:17 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme