This article has 6 methods to mitigate thundering herd problems, including pretty diagrams with each.
Sid
Some thoughts on the “second victim” concept. As a note, I was one of the participants in the discussion on which this article is based.
Fractal Flame
Written in response to a question about the big CrowdStrike outage earlier this year, this article asks: do we need to start using safer languages?
Kode Vicious — ACM Queue
This one used a cool technique I haven’t seen yet: they hardcoded a cutoff time into the old and new systems, so they both automatically cut over simultaneously.
Md Riyadh, Jia Long Loh, Muqi Li, and Pu Li — Grab
Here’s a great writeup of a problem with the UK flight system involving a latent bug. Among several cool takeaways, I really liked the way the official incident report didn’t try to pretend this weird bug could have been foreseen and prevented.
Chris Evans — incident.io
This game day ended up way more serious than intended and exposed a serious Kubernetes configuration flaw, causing a real outage. Oops!
Lawrence Jones
It’s all fun and games until someone accidentally uses too much DTAZ (data transfer between availability zones). Good monitoring story, too!
Grzegorz Skołyszewski — Prezi
OpenAI posted this writeup of an incident earlier this week. They tried to deploy detailed monitoring for their Kubernetes cluster, but the monitoring system overloaded the Kubernetes API.
OpenAI
And here’s Lorin Hochstein’s analysis of OpenAI’s incident writeup, including a recurring theme:
This is a great example of unexpected behavior of a subsystem whose primary purpose was to improve reliability.
Lorin Hochstein