SRE Weekly Issue #486

Slight Reliability Podcast Episode 100: Learning with John Allspaw

For his hundredth(!) episode of Slight Reliability, Stephen Townsend has an awesome chat with John Allspaw. I especially loved the part where John pointed out that different people will get different “Aha Moments” from the same incident.

Stephen Townshend

RTO vs RPO: Key Differences for Modern Disaster Recovery

This article delves deep into the nuances of Recovery Time Objective and Recovery Point Objective and how to manage both without spending too much. There’s a strong theme of using feature flags as you might expect from this company, but this article goes beyond being just a one-dimensional product pitch.

Jesse Sumrak — LaunchDarkly

ChatOps fatigue: how to create alerts that matter

A discussion of the qualities of a good alert and how to audit and improve your alerting.

Hannah Roy — Tines

Component defects: RCA vs RE

This one contrasts two views on latent defects in our systems, from Root Cause Analysis and Resilience Engineering perspectives. The RE perspective looks scary, but it’s much more nuanced than that.

Lorin Hochstein

When Caches Collide: Solving Race Conditions in Fare Updates

Grab has seen multiple scenarios in which concurrent cache writes result in inconsistent fares. This article explains their strategies for detecting and dealing with them.

Ravi Teja Thutari — DZone

Budibase Cloud January 9th Incident

Adding a node to a CouchDB cluster went poorly, resulting in lost data in this incident from 2024.

The mistake we made in our automated process for adding nodes was to add the new node to our load balancer before it had fully synchronised.

Sam Rose — Budibase

We had issues with Monzo on 29th July. Here’s what happened, and what we did to fix it.

The parallels between this incident and the Budibase one above are striking! I swear it’s a coincidence that I came across both of these old incident reports in the same week.

Chris Evans and Suhail Patel — Monzo

Cloudflare 1.1.1.1 Incident on July 14, 2025

Another tricky failure mode for Cloudflare’s massive DNS resolver service. They share all the details in this post with their usual flare (sorry, I couldn’t resist).

Ash Pallarito and Joe Abley — Cloudflare

SRE Weekly Issue #486

Subscribe

RSS

Mastodon

Search Issues

A message from our sponsor, Spacelift:

Subscribe

RSS

Mastodon

Search Issues