SRE Weekly Issue #222

This article in a nutshell:

Kolton Andrus — Gremlin

I hadn’t heard of this distinction before. If you haven’t either, click through to find out more.

Ayende Rahien — RavenDB

In our experience, the three big sources of production stress are:

  • Toil
  • Bad monitoring
  • Immature incident handling procedures

Cheryl Kang — Google

ProPublica picks apart the incident in exhaustive detail, showing how multiple problems interwoven in the organization contributed to this tragedy.

Robert Faturechi, Megan Rose and T. Christian Miller — ProPublica

There’s a great review of Rasmussen’s safety boundary model, which I wasn’t previously familiar with. A system moves between three boundaries:

  • the boundary to economic failure
  • the boundary of unacceptable work load
  • the boundary of functionally acceptable performance

Lorin Hochstein

This one includes a really nifty graph showing how reliable your N backend microservices need to be in order to hit a given reliability target R.

Bill Duncan

Here are the results of the survey I linked here a couple weeks ago. There are some interesting and surprising results, well worth a read.

Rich Burroughs — FireHydrant

A commonly-used CA’s Root certificate expired, causing some havoc. Even though Sectigo did everything right, some software didn’t handle the transition to the new root well.

Paul Ducklin — Naked Security


Updated: June 7, 2020 — 9:15 pm
