SRE Weekly Issue #222

A message from our sponsor, StackHawk:

The last thing we need is more noise from more tooling. With the new Findings Management feature, you can add AppSec tests to your CI pipeline without being innundated with alerts.
https://www.stackhawk.com/blog/appsec-findings-management?utm_source=SREWeekly

Articles

This article in a nutshell:

Kolton Andrus — Gremlin

I hadn’t heard of this distinction before. If you haven’t either, click through to find out more.

Ayende Rahien — RavenDB

In our experience, the three big sources of production stress are:

  • Toil
  • Bad monitoring
  • Immature incident handling procedures

Cheryl Kang — Google

ProPublica picks apart the incident in exhaustive detail, showing how multiple problems interwoven in the organization contributed to this tragedy.

Robert Faturechi, Megan Rose and T. Christian Miller — ProPublica

There’s a great review of Rasmussen’s safety boundary model, which I wasn’t previously familiar with. A system moves between three boundaries:

  • the boundary to economic failure
  • the boundary of unacceptable work load
  • the boundary of functionally acceptable performance

Lorin Hochstein

This one includes a really nifty graph showing how reliable your N backend microservices need to be in order to hit a given reliability target R.

Bill Duncan

Here are the results of the survey I linked here a couple weeks ago. There are some interesting and surprising results, well worth a read.

Rich Burroughs — FireHydrant

A commonly-used CA’s Root certificate expired, causing some havoc. Even though Sectigo did everything right, some software didn’t handle the transition to the new root well.

Paul Ducklin — Naked Security

Outages

Updated: June 7, 2020 — 9:15 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme