This article in a nutshell:
- Nines don’t matter if users aren’t happy (h/t Charity Majors)
- Chaos engineering
Kolton Andrus — Gremlin
I hadn’t heard of this distinction before. If you haven’t either, click through to find out more.
Ayende Rahien — RavenDB
In our experience, the three big sources of production stress are:
- Bad monitoring
- Immature incident handling procedures
Cheryl Kang — Google
ProPublica picks apart the incident in exhaustive detail, showing how multiple problems interwoven in the organization contributed to this tragedy.
Robert Faturechi, Megan Rose and T. Christian Miller — ProPublica
There’s a great review of Rasmussen’s safety boundary model, which I wasn’t previously familiar with. A system moves between three boundaries:
- the boundary to economic failure
- the boundary of unacceptable work load
- the boundary of functionally acceptable performance
This one includes a really nifty graph showing how reliable your N backend microservices need to be in order to hit a given reliability target R.
Here are the results of the survey I linked here a couple weeks ago. There are some interesting and surprising results, well worth a read.
Rich Burroughs — FireHydrant
A commonly-used CA’s Root certificate expired, causing some havoc. Even though Sectigo did everything right, some software didn’t handle the transition to the new root well.
Paul Ducklin — Naked Security
- Coinbase had an outage on June 1. Click for their post-incident analysis.
- Robinhood’s status page doesn’t show history, so I can’t verify this one.
- Ebay’s status page also doesn’t show history, so I can’t verify this one either.
- Lloyds and Halifax (bank)
- Adobe Cloud
- Their followup post discusses the large-scale DDoS that contributed to the outage.