General

SRE Weekly Issue #486

A message from our sponsor, Spacelift:

IaC Experts! IaCConf Call for Presenters – August 27, 2025

The upcoming IaCConf Spotlight dives into the security and governance challenges of managing infrastructure as code at scale. From embedding security in your pipelines to navigating the realities of open source risk, this event brings together practitioners who are taking a security-minded approach to how they implement IaC in their organization.

Call for Presenters is now open until Friday, August 1. Submit your CFP or register for the free event today.

https://events.iacconf.com/iac-security-spotlight-august-2025/?utm_medium=email&utm_source=sreweekly

For his hundredth(!) episode of Slight Reliability, Stephen Townsend has an awesome chat with John Allspaw. I especially loved the part where John pointed out that different people will get different “Aha Moments” from the same incident.

  Stephen Townshend

This article delves deep into the nuances of Recovery Time Objective and Recovery Point Objective and how to manage both without spending too much. There’s a strong theme of using feature flags as you might expect from this company, but this article goes beyond being just a one-dimensional product pitch.

  Jesse Sumrak — LaunchDarkly

A discussion of the qualities of a good alert and how to audit and improve your alerting.

  Hannah Roy — Tines

This one contrasts two views on latent defects in our systems, from Root Cause Analysis and Resilience Engineering perspectives. The RE perspective looks scary, but it’s much more nuanced than that.

  Lorin Hochstein

Grab has seen multiple scenarios in which concurrent cache writes result in inconsistent fares. This article explains their strategies for detecting and dealing with them.

   Ravi Teja Thutari — DZone

Adding a node to a CouchDB cluster went poorly, resulting in lost data in this incident from 2024.

The mistake we made in our automated process for adding nodes was to add the new node to our load balancer before it had fully synchronised.

  Sam Rose — Budibase

The parallels between this incident and the Budibase one above are striking! I swear it’s a coincidence that I came across both of these old incident reports in the same week.

  Chris Evans and Suhail Patel — Monzo

Another tricky failure mode for Cloudflare’s massive DNS resolver service. They share all the details in this post with their usual flare (sorry, I couldn’t resist).

  Ash Pallarito and Joe Abley — Cloudflare

SRE Weekly Issue #485

YOUR AD COULD BE HERE!

SRE Weekly has openings for new sponsorships. Reply or email lex at sreweekly.com for details.

How would you migrate several million databases, with minimal impact to your users?

Atlassian allocates one Postgres database per tenant customer, with a few thousand colocated on each RDS instance. This migration story was a riveting read!

  Pat Rubis — Atlassian

Here’s my claim: providing details on how things went well will reduce your future mitigation time even more than focusing on what went wrong.

  Lorin Hochstein

My favorite part of this article was the explanation of how they handle pent-up logs when a customer’s endpoint recovers, without overwhelming the endpoint.

  Gabriel Reid — Datadog

How do you deal with fundamental surprise? This article introduces the concept of surprise2, an incident you couldn’t see coming. Click through for some strategies to handle the inevitable occasional fundamentally surprising incident.

  Stuart Rimell — Uptime Labs

A team found themselves needing to switch to microservices, and they chronicled their approach and results. I really like the section on the surprises they encountered.

   Shushyam Malige Sharanappa — DZone

Dropbox shares what went into the rollout of their new fleet, including careful management of heat, vibration, and power.

  Eric Shobe and Jared Mednick — Dropbox

In this blog post, we’ll dive into the details of three mighty alerts that play their unique role in supporting our production infrastructure, and explore how they’ve helped us maintain the high level of performance and uptime that our community relies on.

…plus one bonus alert!

  Jeremy Udit — Hugging Face

Klaviyo adopted RDS’s blue/green deployment feature to make MySQL version upgrades much less painful. In this article they share their path to blue/green deployment and their results.

  Marc Dellavolpe — Klaviyo

SRE Weekly Issue #484

This is really neat! They’ve developed a new approach to search that uses 3-letter “trigrams” rather than tokenizing words, making it especially well-suited to code search. It converts regular expressions to trigram searches behind the scenes.

  Dmitry Gruzd — GitLab

This article about LLMs is by a regularly featured author here in the newsletter. It’s not, strictly speaking, directly SRE-related, but I really got a lot out of it, so I’m including it anyway.

  Lorin Hochstein

This one explains the difference between a soft and hard dependency, why it matters, and how to use this information to improve reliability. I like the section on soft dependencies evolving into hard dependencies when you’re not looking.

  Teiva Harsanyi — The Coder Cafe

In this post, we’ll walk through how we’re splitting apart our shared database into independently owned instances. We’ll explain how we defined the right boundaries, minimized risk during migrations, and built the tooling to make the process safe and scalable.

  Fabiana Scala and Tali Gutman — Datadog

At some point, the external dependencies which our systems rely on become so tightly coupled, large, and fundamental that should those foundations inevitably fail, that blame can actually go down in response to an incident.

This thought-provoking article explores why we’re more tolerant of outages from large tech companies like Google Cloud or Salesforce, and what this means for how we think about reliability engineering and incident response.

  Will Gallego

This practical guide shows how to use AWS Fault Injection Service (FIS) to perform chaos engineering experiments on self-managed Cassandra clusters. It walks through setting up experiments to test node failure scenarios and validate that applications can properly handle database outages through connection pooling and retry mechanisms.

  Hans Nesbitt and Lwanga Phillip — AWS

Klaviyo shares how they built an automated recovery system to handle billing usage tracking failures. The system uses S3 for data storage and SQS for message queuing to ensure that missed usage events are automatically recovered, eliminating manual intervention and reducing customer confusion.

  Kaavya Antony — Klaviyo

Final stretch! We’ve handled people and processes, now let’s crack the code side and stitch everything together into a four-stage framework you can reuse.

In case you missed them:

  Konstantin Rohleder — HelloFresh

SRE Weekly Issue #483

A message from our sponsor, PagerDuty:

When the internet faltered on June 12th, other incident management platforms may have crashed—but PagerDuty handled a 172% surge in incidents and 433% spike in notifications flawlessly. Your platform should be rock-solid during a storm, not another worry.

See what sets PagerDuty’s reliability apart.

If you focus too narrowly on preventing the specific details of the last incident, you’ll fail to identify the more general patterns that will enable your future incidents.

  Lorin Hochstein

An interesting thought: scaffolding our software systems to make them more robust might actually hamper our sociotechnical system’s overall resilience. I love the horticultural analogy.

  Stuart Rimell — Uptime Labs

As LLM services become more prevalent, traditional infrastructure metrics like availability and latency are no longer sufficient on their own to measure reliability. What should we use instead?

  T-sato — Mercari

Here’s a primer on chaos testing in Kubernetes, including a tutorial on using CNCF’s LitmusChaos tool to perform chaos experiments in your cluster. It’s more than just a tutorial, because it covers theoretical topics like chaos testing anti-patterns.

   Josephine Eskaline Joyce — DZone

The problem space seems simple, but the theme here is scale: simple solutions just don’t work in an infrastructure the size of Datadog’s.

  Gabriel Reid — Datadog

This second installment focuses on operational complexity and strategic decision-making for large-scale initiatives. The article covers when to use formal programs versus working groups, how to leverage prioritization to reduce operational burden, and strategies for phased rollouts that balance technical complexity with agility.

  Konstantin Rohleder — HelloFresh

This article challenges the assumption that popular DevOps practices are universally beneficial, arguing that teams should evaluate whether practices like Kubernetes, SLOs, or GitOps actually solve their specific problems rather than adopting them because “everyone else does.”

  Tom Elliott — The Friday Deploy

This short post covers: * Why does this distinction matter? * An illustration to build a memorable base * Quotes from Google’s books

  Alex Ewerlöf

SRE Weekly Issue #482

A message from our sponsor, PagerDuty:

Incidents move fast. But you’ll never get left behind with PagerDuty’s GenAI incident response assistant, available in all paid plans. Get instant business impact analysis, troubleshooting steps, and auto-drafted status updates—directly in Slack. Stop context-switching, start resolving faster.

https://fnf.dev/4dZ5V36

Salesforce posted an analysis of their major outage on June 10. An autmated update restarted networking, and routing rules ended up in a bad state. This is remarkably similar to Datadog’s incident in March of 2023.

  Salesforce.

In this article, the author likens LLMs to magic, in that they’re a black box in some ways. That has implications for how we go about building reliable systems around them.

  Lorin Hochstein

An executive learns a valuable lesson about the ways they can be useful during an incident — and ways they might inadvertently cause disruption.

  Hamed Silatani — Uptime Labs

This article is a summary of a new paper on how to figure out if your system is susceptible to metastable failure modes.

  Murat Demirbas

This article explores how modern teams can effectively implement, track, and leverage CFR [Change Failure Rate] to drive continuous improvement in their delivery pipelines.

   Saumen Biswas — DZone

A primer on the theory and practice of circuit breakers, including example code using Resilience4j.

   Narendra Lakshmana gowda — DZone

Airbnb introduces their internal load testing framework, Impulse, and shares details about how they perform load testing at scale.

  Chenhao Yang — Airbnb

In this first of a three-part series, HelloFresh introduces their effort to manage complexity. They start by showing what they stand to gain and then introduce high-level strategies.

  Konstantin Rohleder — HelloFresh

A production of Tinker Tinker Tinker, LLC Frontier Theme