SRE Weekly Issue #77

I really love that some of you are taking vacations. Preventing burnout is really critical for improving reliability. That said, if you’d please exempt my address from your vacation auto-responder, that’d be super-cool ;)


New eBook for DevOps pros: The Dev and Ops Guide to Incident Management offers 25+ pages of essential insight into building teams and improving your response to downtime.


Last week, I linked to a reddit story of an engineer that was unfairly fired for a mistake on their first day. Dr. Richard Cook picked this up and wrote up a great analysis of the underlying organizational issues.

Thanks to John Allspaw for this one.

This was released the week before last, but it took me awhile to digest it. The ATO did a very thorough post-analysis on their two outages and released this polished report. I like that they took full responsibility for the outage even though it was an issue with a fully-managed vendor SAN offering, and they clearly sought to learn as much as possible.

Pinterest tech lead Suman Karumuri explains how they use distributed tracing and the benefits it’s brought them.

With these new use cases, we see tracing infrastructure as the third pillar of monitoring our services in addition to metrics and log search systems.

Frustrated by British Airways’s Willie Walsh’s public statement regarding their major outage, TripWire founder Gene Kim took it upon himself to write an open letter of apology as if he were an airline CEO.  It’s pretty great.

This article explores several options for HA with Nginx: put an ELB in front of it, Route 53 with health checks, or an elastic IP switched either by keepalived or a Lambda function.

I’ve been following GitLab’s blog since their engineer accidentally deleted their database earlier this year, and I’m glad I did. This article touches on all sorts of topics near to my heart: preventing burnout, examining incident response metrics, enforcing vacations, incident command, and having developers go on-call for what they wrote.

The costs associated with running a full-capacity redundant system in a secondary site can be numerous and subtle. Those costs can be especially hard to swallow when expected returns on infrastructure investments prove elusive.

Netflix explains in depth the careful scientific experiments they perform in production in order to improve the QoE (quality of experience).


  • Google Cloud Services
    • 62-minute multiple-zone total internet outage in asia-northeast1. Postmortem linked, including a description of several contributing factors.

      We apologize for the impact this issue had on our customers, and especially to those customers with deployments across multiple zones in the asia-northeast1 region. We recognize we failed to deliver the regional reliability that multiple zones are meant to achieve.

  • Coinbase
  • YouTube
Updated: June 18, 2017 — 9:31 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme