SRE Weekly Issue #16

Another packed issue this week, thanks in no small part to the folks on hangops #incident_response. You all rock!

This week, I broke 200 email subscribers. Thank you all so much! At Charity Majors‘s suggestion, I’ve started a Twitter account, SREWeekly, where I’ll post a link to each week’s issue as it comes out. Feel free to unsubscribe via email and follow there instead, if you’d prefer.

Articles

I love this article! Everything it says can be readily applied to SRE. It touches on blameless culture, causes of errors and methods of capturing incident information. Then there’s this excellent tidbit about analyzing all incidents, even near misses:

The majority of organizations target their most serious incidents for immediate attention. Events that lead to severe and/or permanent injury or death are typically underscored in an effort to prevent them from ever happening again. But recurrent errors that have the potential to do harm must also be prioritized for attention and process improvement. After all, whether an incident ultimately results in a near miss or an event of harm leading to a patient’s death is frequently a matter of a provider’s thoughtful vigilance, the resilience of the human body in resisting catastrophic consequences from the event, or sheer luck.

A short postmortem by PagerDuty for an incident earlier this month. I like how precise their impact figures are.

Thanks to cheeseprocedure on hangops #incident_response for this one.

Look past the contentious title, and you’ll see that this one’s got some really good guidelines for running an effective postmortem. To be honest, I think they’re saying essentially the same thing as the “blameless postmortem” folks. You can’t really be effective at finding a root cause without mentioning operator errors along the way; it’s just a matter of how they’re discussed.

Ultimately, the secret of those mythical DevOps blameless cultures that hold the actionable postmortems we all crave is that they actively foster an environment that accepts the realities of the human brain and creates a space to acknowledge blame in a healthy way. Then they actively work to look beyond it.

Thanks to tobert on hangops #incident_response for this one.

Ithaca College has suffered a series of days-long network outages, crippling everything from coursework to radio broadcasts. Their newspaper’s editorial staff spoke out this week on the cause and impact of the outages.

iTWire interviews Matthew Kates, Australia Country Manager for Zerto, a DR software firm, about the troubles Telstra has been dealing with. Kates does an admirable job of avoiding plugging his company, instead offering an excellent analysis of Telstra’s situation. He also gives us this gem, made achingly clear this week by Gliffy’s troubles:

Backing up your data once a day is no longer enough in this 24/7 ‘always on’ economy. Continuous data replication is needed, capturing every change, every second.

I love the part about an “architecture review” before choosing to implement a design for a new component (e.g. Kafka) and an “operability review” before deployment to ensure that monitoring, runbooks, etc. are all in place.

Atlassian posted an excellent, high-detail postmortem on last week’s instability. One of the main causes was an overloaded NAT service in a VPC, compounded by aggressive retries from their client.

Technically speaking, I’m not sure the NPM drama this week caused any actual production outages, but I feel like I’d be remiss in not mentioning it. Suffice it to say that we can never ignore human factors.

In reading the workshop agenda, it’s interesting to see how they handle human error in drug manufacturing.

Pusher shares a detailed description of their postmortem incident analysis process. I like that they front-load a lot of the information gathering and research process before the in-person review. They also use a tool to ensure that their postmortem reports have a consistent format.

Outages

  • Telstra
    • This makes the third major outage (plus a minor one) this year. Customers are getting pretty mad.

  • Gliffy
    • Gliffy suffered a heartbreaking 48-hour outage after an administrator mistakenly deleted the production db. They do have backups, but the backups take a long time to restore.

      Thanks to gabinante on hangops #incident_response for this one.

  • The Division (game)
  • DigitalOcean
    • A day after the incident, DigitalOcean posted an excellent postmortem. I like that they clearly explained the technical details behind the incident. While they mentioned the DDoS attack, they didn’t use it to try to avoid taking responsibility for the downtime. Shortly after this was posted, it spurred a great conversation on hangops #incident_response that included the post’s author.

      Thanks to rhoml on hangops #incident_response for this one.

Updated: March 26, 2016 — 10:42 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme