For his hundredth(!) episode of Slight Reliability, Stephen Townsend has an awesome chat with John Allspaw. I especially loved the part where John pointed out that different people will get different “Aha Moments” from the same incident.
Stephen Townshend
This article delves deep into the nuances of Recovery Time Objective and Recovery Point Objective and how to manage both without spending too much. There’s a strong theme of using feature flags as you might expect from this company, but this article goes beyond being just a one-dimensional product pitch.
Jesse Sumrak — LaunchDarkly
A discussion of the qualities of a good alert and how to audit and improve your alerting.
Hannah Roy — Tines
This one contrasts two views on latent defects in our systems, from Root Cause Analysis and Resilience Engineering perspectives. The RE perspective looks scary, but it’s much more nuanced than that.
Lorin Hochstein
Grab has seen multiple scenarios in which concurrent cache writes result in inconsistent fares. This article explains their strategies for detecting and dealing with them.
Ravi Teja Thutari — DZone
Adding a node to a CouchDB cluster went poorly, resulting in lost data in this incident from 2024.
The mistake we made in our automated process for adding nodes was to add the new node to our load balancer before it had fully synchronised.
Sam Rose — Budibase
The parallels between this incident and the Budibase one above are striking! I swear it’s a coincidence that I came across both of these old incident reports in the same week.
Chris Evans and Suhail Patel — Monzo
Another tricky failure mode for Cloudflare’s massive DNS resolver service. They share all the details in this post with their usual flare (sorry, I couldn’t resist).
Ash Pallarito and Joe Abley — Cloudflare