SRE Weekly Issue #154

A message from our sponsor, VictorOps:

The golden signals of SRE will help you create visibility into system health and allow you to proactively build robust services. See how you can start leveraging SRE’s golden signals today:


Hands-down the best thing I’ve read in awhile! The author draws on the work of Nancy Leveson, applying her STAMP theory to a recent incident involving a rogue NPM package that stole bitcoin wallets.

Hillel Wayne

For more on STAMP theory (Systems-Theoretic Accident Modeling and Processes), check out this academic paper by Leveson et al. It centers around a chilling case study of the e. coli poisoning of a community in Canada. While starts off looking to be a clear case of negligence, it quickly becomes apparent that an accident of this sort was nearly guaranteed to happen.

Nancy Leveson, Mirna Daouk, Nicolas Dulac, and Karen Marais

It’s pretty much as awesome as you’d expect given that title. I originally thought this was a video or audio AMA and was waiting for a recording to be posted. Instead, he answered the excellent questions in the comments, and each answer is like its own polished article.

John Allspaw (and many commenters)

My fundamental issue with being on call is that I care more about my personal life & health than I do about whether my employer’s website is operational.

I assume everyone does! So…why do we put up with on-call at all?

Required reading for anyone who’s on call or manages folks that are on call.

Sarah Mei

If you manage an SRE team or intend to start one, this article will help you understand the types of documents your team needs to write and why each type is needed, allowing you to plan for and prioritize documentation work along with other team projects.

Shylaja Nukala and Vivek Rau — ACM Qeueue


  • Amazon Alexa
  • Discord
  • Google Cloud
    • Followup post for an incident that occurred on December 21:

      The additional load was created by a partially-deployed new feature. A routine maintenance operation in combination with this new feature resulted in an unexpected increase in the load on the metadata store.

  • 911 emergency service in communities across the US
    • While visiting the library with my kids, my phone (and those of others around me) blew up with an emergency alert telling me that 911 service was down and that I should dial the local police directly. CenturyLink provides the infrastructure that runs emergency phone services for various areas in the US, and they had an extended outage.
  • Snapchat
  • Banner Health electronic health records
Updated: January 6, 2019 — 8:26 pm
SRE WEEKLY © 2015 Frontier Theme