SRE Weekly Issue #162

A message from our sponsor, VictorOps:

Ever been on-call? Then you know it can suck. Check out some of our tips and tricks to see how SRE teams are maintaining composure during a critical incident and making on-call suck less:


Want to nerd out on BGP? Check out how this person modeled the Eve Online universe as an 8000-VM cluster running BGP.

Ben Cartwright-Cox

Accrued vacation time is antiquated, and “unlimited” vacation paradoxically leads employees to take less time overall. Time to enforce vacations, lest we forget that burnout is a reliability risk.

Baron Schwartz

How to avoid catastrophe: pay attention to near misses. This article makes an incredibly compelling point that we need to make a conscious effort to pay attention to near misses, and explains how cognitive bias will tend to make us do the exact opposite.

Catherine H. Tinsley, Robin L. Dillon, and Peter M. Madsen — Harvard Business Review

An intro to how blame causes problems, why blamelessness is better, and how to adopt a blameless culture.

Ashar Rizqi

A 100-year-old chemical company thought they had a great safety record. Turns out that folks were just considering accidents “routine” and not reporting them.

Thai Wood (reviewing a paper by Stefanie Huber, Ivette van Wijgerden, Arjan de Witt, and Sidney W.A. Dekker) has 50,000 servers and many SRE squads. They developed tools they call the Reliability Collaboration Model and the Ownership Map to help them define which products SRE squads support and at what level.

Emmanuel Goossaert —


  • New Relic
  • Duo Security
  • Amtrak (US long-distance passenger rail)
    • Amtrak had an outage of its switching system this past week. Linked above is an article with the inflammatory title, “Human error? Try abject stupidity, Amtrak”. Exercise: try to think of ways in which this is not a case of abject stupidity.

      Rich Miller — Capitol Fax

  • YouTube
Updated: March 3, 2019 — 8:23 pm
SRE WEEKLY © 2015 Frontier Theme