SRE Weekly Issue #222

View on sreweekly.com

Articles

Meaningful availability: How many nines do you actually need?

This article in a nutshell:

Nines don’t matter if users aren’t happy (h/t Charity Majors)
Chaos engineering

Kolton Andrus — Gremlin

Byzantine and non-Byzantine distributed systems

I hadn’t heard of this distinction before. If you haven’t either, click through to find out more.

Ayende Rahien — RavenDB

Using SRE to meet reliability challenges

In our experience, the three big sources of production stress are:

Toil

Bad monitoring

Immature incident handling procedures

Cheryl Kang — Google

Faulty Equipment, Lapsed Training, Repeated Warnings: How a Preventable Disaster Killed Six Marines

ProPublica picks apart the incident in exhaustive detail, showing how multiple problems interwoven in the organization contributed to this tragedy.

Robert Faturechi, Megan Rose and T. Christian Miller — ProPublica

SRE, CSE, and the safety boundary

There’s a great review of Rasmussen’s safety boundary model, which I wasn’t previously familiar with. A system moves between three boundaries:

the boundary to economic failure
the boundary of unacceptable work load
the boundary of functionally acceptable performance

Lorin Hochstein

The Tail at Scale Revisited

This one includes a really nifty graph showing how reliable your N backend microservices need to be in order to hit a given reliability target R.

Bill Duncan

Oncall and COVID-19 Survey Results

Here are the results of the survey I linked here a couple weeks ago. There are some interesting and surprising results, well worth a read.

Rich Burroughs — FireHydrant

The mystery of the expiring Sectigo web certificate

A commonly-used CA’s Root certificate expired, causing some havoc. Even though Sectigo did everything right, some software didn’t handle the transition to the new root well.

Paul Ducklin — Naked Security

Outages

PagerDuty
Coinbase
- Coinbase had an outage on June 1. Click for their post-incident analysis.
Robinhood
- Robinhood’s status page doesn’t show history, so I can’t verify this one.
iCloud
Ebay
- Ebay’s status page also doesn’t show history, so I can’t verify this one either.
Lloyds and Halifax (bank)
Adobe Cloud
Squarespace
- Their followup post discusses the large-scale DDoS that contributed to the outage.
HostedGraphite
Telegram

SRE Weekly Issue #222

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

A message from our sponsor, StackHawk:

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues