SRE Weekly Issue #81

SPONSOR MESSAGE

The definitive guide for DevOps Post-Incident Reviews (AKA – Postmortems). Learn why traditional methods don’t work – and why fast incident response isn’t enough. Download your free copy of the 90+ page eBook from O’Reilly Media and VictorOps.
http://try.victorops.com/post_incident_review/SREWeekly

Articles

PagerDuty shared this timeline of their progress in adopting Chaos Engineering through their Failure Friday program. This is brilliant:

We realized that Failure Fridays were a great opportunity to exercise our Incident Response process, so we started using it as a training ground for our newest Incident Commanders before they graduated.

I’m a big proponent of having developers own their code in production. This article posits that SRE’s job is to provide a platform that enables developers to do that more easily. I like the idea that containers and serverless are ways of getting developers closer to operations.

These platforms and the CI/CD pipelines they enable make it easier than ever for teams to own their code from desktop to production.

This reads less like an interview and more like a description of Amazon’s incident response procedure. I started paying close attention at step 3, “Learn from it”:

Vogels places the blame not on the engineer directly responsible, but Amazon itself, for not having failsafes that could have protected its systems or prevented the incorrect input.

Jonathan is a platform engineer at VictorOps, responsible for system scalability and performance. This is Part 1 in a 3-part series on system visibility, the detection part of incident management.

This article is published by my sponsor, VictorOps, but their sponsorship did not influence its inclusion in this issue.

This article is about a different kind of human factor than articles I often link to: cognitive bias. The author presents a case for SREs as working to limit the effects of cognitive bias in making operational decisions.

Outages

  • OVH
    • OVH suffered a major outage in a datacenter, taking down 50,000 websites that they host. The outage was caused by a leak in their custom water-cooling system and resulted in a painfully long 24-hour recovery from an offsite backup. The Register’s report (linked) is based on OVH’s incident log and is the most interesting datacenter outage description I’ve read this year.
  • Google Cloud Storage
    • Google posted this followup for an outage that occurred on July 6th. As usual, it’s an excellent read filled with lots of juicy details. This caught my eye:

      […] attempts to mitigate the problem caused the error rate to increase to 97%.

      Apparently this was caused by a “configuration issue” and was quickly reverted. It’s notable that they didn’t include anything about this error in the remediations section.

  • Melbourne, AU’s Metro rail network
    • A network outage stranded travelers, and switching to the DR site “wasn’t an option”.
  • Somalia
Updated: July 16, 2017 — 9:19 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme