SRE Weekly Issue #67


Are your incident management skills sharp, or are you continuously fighting fires? Take the free, online incident management assessment from VictorOps and compare your practices against leading DevOps methodologies:


This article is about the risks of automation. While automation can reduce risk┬áby making errors less likely, it also disengages human operators from what’s actually happening, meaning that they’re less likely to catch and correct problems.

The author spent seven months sifting through, categorizing, and documenting over 1700 production incidents. The result was impressive: a massive improvement in the SRE team’s incident response process and documentation. It’s got me wondering if we can do something similar at $JOB.

Thanks to Steven Farlie for this one.

A danger of a microservice architecture is that one failing service can affect those that depend on it, even indirectly. The Netflix API handles over 10000 requests per second, and it was carefully designed to avoid the case where a slow dependency breaks unrelated requests.

Without taking steps to ensure fault tolerance, 30 dependencies each with 99.99% uptime would result in 2+ hours downtime/month (99.99% * 30 = 99.7% uptime = 2+ hours in a month).

Nuclear Family is an interactive play in which the audience is presented with critical decisions as the characters move inexorably toward a nuclear plant disaster. The goal is to demonstrate local rationality, the principal that people make the best decision they can with the information they have at hand — even if in retrospect that decision led to an adverse outcome.

Last year, PagerDuty moved toward giving developers operational responsibilty for the systems they create. The really cool thing about their transition is that they have hard stats on reduction of incidents, decrease in MTTR, and increase in changes deployed to production.

This post is primarily a new feature announcement, but the intro section is just awesome. I love the idea of designing a system with empathy for your future self that will be on call for it.

A short but enlightening blog post on designing systems to degrade gracefully.

when weird stuff happens, make sure it doesn’t cause harm you didn’t expect or plan for.


  • Razer
    • Notably, this outage reset the careful customizations that people had made to their peripherals.

      Thanks to Steven Farlie for this one.

  • Heroku
    • Heroku had a 2-day long disruption that spanned 3 status site posts.

      Full disclosure: Heroku is my employer.

  • DigitalOcean
    • DigitalOcean accidentally deleted their primary database, resulting in a ~5-hour outage.

      A process performing automated testing was misconfigured using production credentials.

Updated: April 9, 2017 — 10:00 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme