SRE Weekly Issue #66


Are your incident management skills sharp, or are you continuously fighting fires? Take the free, online incident management assessment from VictorOps and compare your practices against leading DevOps methodologies:


I hope you’ll enjoy reading this debug session as much as I enjoyed co-writing it. My former co-worker and I did some serious digging to get to the bottom of an unexpected EADDRINUSE that caused a production incident.

Full disclosure: Heroku is my employer.

Distributed filesystems provide high availability by duplicating data. In this research paper, the researchers created errorfs, a FUSE plugin that passes through a backing filesystem but introduces single-bit errors. Result: almost all major distributed filesystems missed the error, resulting in corruption.

The part I like most about this article is the emphasis on the difference between DR and HA.

Full disclosure: Heroku, my employer, is mentioned.

The S3 outage a month ago is a great reminder that chaos experiments are useful not just for taking down parts of our own infrastructure, but also simulating the failure of external dependencies.

There are several core HumanOps principles, but the most important one to remember is that human health impacts business health.

It’s about time that we recognised that engineers are humans who get stressed and need downtime and that there are strong business as well as social reasons why these needs should be met.

Impressively quickly, USENIX has posted the videos from SRECon17 Americas! I’ve linked to a post by Woodland Hunter, whose review of SRECon I featured here two weeks ago, with links to the talks he reviewed and more.

The first article is published by my sponsor, VictorOps, but their sponsorship did not influence its inclusion in this issue.

PagerDuty theorizes that if developers don’t trust the incident response process, they’ll fear outages and thus be less productive. Proper incident management eases that fear so that they feel safer deploying code.

This article could be titled, “Use these three wacky tricks to reduce your pages by 100x!” In all seriousness, the methods described are aggregation (group related alerts), routing (sort alerts by team), and classification (page-worthy alerts versus warnings).

This article is published by my sponsor, VictorOps, but their sponsorship did not influence its inclusion in this issue.

A nice primer on using tc to induce latency, which is really important when testing the resiliency of systems to network instability. Thanks, Julia!

Here’s the second half of Stephen Thorne’s commentary on “Embracing Risk”, the third chapter in Google’s SRE book.

As your company grows in infrastructure size, number of employees, load, and other areas, how do you change your incident response to cope?


  • Azure status history
    • While following up on an outage from a couple of weeks ago, I came upon this archive of Azure incidents, several with detailed postmortems. It’s a goldmine of interesting RCAs, but I wish they’d give each its own page for easy linking.
Updated: April 2, 2017 — 9:38 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme