SRE Weekly Issue #143


Minimum viable runbooks are a way to spend less time building runbooks and more time using them. Learn more about creating actionable runbooks to support SRE and make on-call suck less:


There’s some great statistics theory in here. The challenge is: how can you have accurate, useful A/B tests without having to wait a long time to get a big enough sample size? Can you bail out early if you know the test has already failed? Can you refine the new feature mid-test?

Callie McRee and Kelly Shen — Etsy

Don’t just rename your Ops team to “SRE” and expect anything different, says this author.

Ernest Mueller — The Agile Admin

Great idea:

So what if we monitor the percentage of requests that are over the threshold instead? To alert us when our SLAs are violated, we can trigger alarms when that percentage is greater than 1% over some predefined time window.

Yan Cui

There’s a ton of detail here, and it’s a great read. Lots of juicy tidbits about PoP selection, load balancing, and performance monitoring.

Oleg Guba and Alexey Ivanov — Dropbox

Full disclosure: Fastly, my employer, is mentioned.

Even as a preliminary report there’s a lot to digest here about what caused the series of gas explosions last month in Massachusetts (US). I feel like I’ve been involved in incidents with similar contributing factors.

US National Transportation Safety Board (NTSB)

This isn’t just a recap of a bad day, although the outage description is worth reading by itself. Readers also gain insight into the evolution of this engineer’s career and mindset, from entry-level to Senior SRE.

Katie Shannon — LinkedIn

GitLab, in their trademark radically open style, goes into detail on the reasons behind the recent increase in the reliability of their service.

Andrew Newdigate — GitLab

Five nines are key when you consider that Twilio’s service uptime can literally mean life and death. Click through to find out why.

Charlie Taylor — Blameless


  • Travis CI
  • Google Compute Engine us-central1-c
    • I can’t really summarize this incident report one well, but I highly recommend reading it.
  • Azure
    • Duplicated here since I can’t deep-link:

      Summary of impact: Between 01:22 and 05:50 UTC on 13 Oct 2018, a subset of customers using Storage in East US may have experienced intermittent difficulties connecting to resources hosted in this region. Other services leveraging Storage in the region may have also experienced impact related to this incident.

  • Instagram
  • Heroku
    • This one’s notable for the duration: about 10 days of diminished routing performance due to a bad instance.
  • Microsoft Outlook
Updated: October 14, 2018 — 8:24 pm
SRE WEEKLY © 2015 Frontier Theme