SRE Weekly Issue #486

A message from our sponsor, Spacelift:

IaC Experts! IaCConf Call for Presenters – August 27, 2025

The upcoming IaCConf Spotlight dives into the security and governance challenges of managing infrastructure as code at scale. From embedding security in your pipelines to navigating the realities of open source risk, this event brings together practitioners who are taking a security-minded approach to how they implement IaC in their organization.

Call for Presenters is now open until Friday, August 1. Submit your CFP or register for the free event today.

https://events.iacconf.com/iac-security-spotlight-august-2025/?utm_medium=email&utm_source=sreweekly

For his hundredth(!) episode of Slight Reliability, Stephen Townsend has an awesome chat with John Allspaw. I especially loved the part where John pointed out that different people will get different “Aha Moments” from the same incident.

  Stephen Townshend

This article delves deep into the nuances of Recovery Time Objective and Recovery Point Objective and how to manage both without spending too much. There’s a strong theme of using feature flags as you might expect from this company, but this article goes beyond being just a one-dimensional product pitch.

  Jesse Sumrak — LaunchDarkly

A discussion of the qualities of a good alert and how to audit and improve your alerting.

  Hannah Roy — Tines

This one contrasts two views on latent defects in our systems, from Root Cause Analysis and Resilience Engineering perspectives. The RE perspective looks scary, but it’s much more nuanced than that.

  Lorin Hochstein

Grab has seen multiple scenarios in which concurrent cache writes result in inconsistent fares. This article explains their strategies for detecting and dealing with them.

   Ravi Teja Thutari — DZone

Adding a node to a CouchDB cluster went poorly, resulting in lost data in this incident from 2024.

The mistake we made in our automated process for adding nodes was to add the new node to our load balancer before it had fully synchronised.

  Sam Rose — Budibase

The parallels between this incident and the Budibase one above are striking! I swear it’s a coincidence that I came across both of these old incident reports in the same week.

  Chris Evans and Suhail Patel — Monzo

Another tricky failure mode for Cloudflare’s massive DNS resolver service. They share all the details in this post with their usual flare (sorry, I couldn’t resist).

  Ash Pallarito and Joe Abley — Cloudflare

Updated: July 20, 2025 — 9:15 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme