SRE WEEKLY – scalability, availability, incident response, automation

SRE Weekly Issue #488

lex

August 3, 2025

The Wild Story of the Taum Sauk Dam Failure

A story of the failure of a pumped energy storage facility, involving all of our favorite features like complex contributing factors, work-as-done vs work-as-designed, and early warning signs only obvious in hindsight. As a bonus, no one was killed.

Practical Engineering

Jet Lag: The Traffic

Nebula‘s streaming service has a surprisingly write-heavy workflow owing to storing bookmarks of the latest point a given user has watched in a video. That makes scaling an interesting challenge.

Sam Rose — Nebula

Debugging the One-in-a-Million Failure: Migrating Pinterest’s Search Infrastructure to Kubernetes

I love the debugging technique they used: kill processes one at a time until performance improves.

Samson Hu, Shashank Tavildar, Eric Kalkanger, and Hunter Gatewood — Pinterest

When Process Becomes Latency: Optimizing Incident Response Cadence

This article is about finding the balance between having enough process to ensure incident response goes smoothly, and having so much process that incident responders are unable to adapt to unexpected situations.

Brandon Chalk — Rootly

What Experts See That the Rest of Us Miss During Incidents

This article presents two case studies of dialog during incidents along with analysis of each. How does your own analysis compare?

Hamed Silatani — Uptime Labs

Chris’s Wiki :: blog/sysadmin/MachineRoomTempTwoSortsOfAlerts

They realized that a single alert can’t catch both a sudden AC failure and an AC that becomes slowly but steadily overwhelmed.

Chris Siebenmann

Cloudflare and the infinite sadness of migrations

Thoughts on migrations as a significant source of reliability risk.

[…] engineering organizations at tech companies need to make migrations a part of their core competency, rather than seeing them as one-off chores.

Lorin Hochstein

Google Cloud Platform incident report: July 18, 2025

An incorrect physical disconnection was made to the active network switch serving our control plane, rather than the redundant unit scheduled for removal.

This reminds me of wrong-side surgery incidents and aircraft pilots shutting off the good engine when one fails.

Google

SRE Weekly Issue #487

lex

July 27, 2025

General

Comments

View on sreweekly.com

Next Gen Data Processing at Massive Scale At Pinterest With Moka (Part 1 of 2)

Pinterest decided to replace their Hadoop+Spark-based data processing pipeline with one based on Kubernetes.

In part one, we provide rationale for our new technical direction prior to outlining the overall design and detailing the application focused layer of our platform. We conclude with current status and some of our learnings.

Soam Acharya, Rainie Li., William Tom, and Ang Zhang — Pinterest

How AI-generated code is quietly increasing system risk

This article raises some important concerns that are worth thinking about.

It’s fast and feels efficient, but it masks a drop in codebase familiarity. Over time, your top engineers stop being system experts.

Alexander Procter — Okoone

Avoiding the ironies of automation

I really love the care taken in this article to consider the potential risks of AI tools for incident response. There are many valuable insights that make this article way more than just a sales pitch for their tool.

Chris Evans — incident.io

Quicksilver v2: evolution of a globally distributed key-value store (Part 2)

Quicksilver a globally distributed key-value store serving billions of requests per second where speed is critical, so you know the scaling challenges are going to be interesting.

Marten van de Sanden and Anton Dort-Golts — Cloudflare

Practical Problems with Auto-Increment

This article gives reproducible cases in which MySQL and Postgres can reuse auto-increment IDs.

I think I’ve seen this advice violated at nearly every company I’ve worked at:

Best practice dictates that you shouldn’t be using IDs from database tables outside of that table unless it’s some foreign key field

Sam Rose

Choosing Between Count and For-Each

Here’s a great explanation of why it’s often better to use for_each instead of count in Terraform.

Ned Bellavance

How we tracked down a Go 1.24 memory regression across hundreds of pods

This debugging story really drew me in. It’s so incredibly satisfying the way their initial theory was confirmed so tidily in the end.

Nayef Ghattas — Datadog

The Art of Not Getting Woken Up for Nothing

In our latest Rootly roundtable, we sat down with a group of seasoned SREs (collectively packing over 100 years of ops scars) to trade notes on what makes an alert useful, what makes it noise, and how to build alerting systems that teams can trust.

Here are their top strategies distilled for you:

Jorge Lainfiesta — Rootly

SRE Weekly Issue #486

lex

July 20, 2025

General

Comments

View on sreweekly.com

Slight Reliability Podcast Episode 100: Learning with John Allspaw

For his hundredth(!) episode of Slight Reliability, Stephen Townsend has an awesome chat with John Allspaw. I especially loved the part where John pointed out that different people will get different “Aha Moments” from the same incident.

Stephen Townshend

RTO vs RPO: Key Differences for Modern Disaster Recovery

This article delves deep into the nuances of Recovery Time Objective and Recovery Point Objective and how to manage both without spending too much. There’s a strong theme of using feature flags as you might expect from this company, but this article goes beyond being just a one-dimensional product pitch.

Jesse Sumrak — LaunchDarkly

ChatOps fatigue: how to create alerts that matter

A discussion of the qualities of a good alert and how to audit and improve your alerting.

Hannah Roy — Tines

Component defects: RCA vs RE

This one contrasts two views on latent defects in our systems, from Root Cause Analysis and Resilience Engineering perspectives. The RE perspective looks scary, but it’s much more nuanced than that.

Lorin Hochstein

When Caches Collide: Solving Race Conditions in Fare Updates

Grab has seen multiple scenarios in which concurrent cache writes result in inconsistent fares. This article explains their strategies for detecting and dealing with them.

Ravi Teja Thutari — DZone

Budibase Cloud January 9th Incident

Adding a node to a CouchDB cluster went poorly, resulting in lost data in this incident from 2024.

The mistake we made in our automated process for adding nodes was to add the new node to our load balancer before it had fully synchronised.

Sam Rose — Budibase

We had issues with Monzo on 29th July. Here’s what happened, and what we did to fix it.

The parallels between this incident and the Budibase one above are striking! I swear it’s a coincidence that I came across both of these old incident reports in the same week.

Chris Evans and Suhail Patel — Monzo

Cloudflare 1.1.1.1 Incident on July 14, 2025

Another tricky failure mode for Cloudflare’s massive DNS resolver service. They share all the details in this post with their usual flare (sorry, I couldn’t resist).

Ash Pallarito and Joe Abley — Cloudflare

SRE Weekly Issue #485

lex

July 13, 2025

General

Comments

View on sreweekly.com

Migrating the Jira Database Platform to AWS Aurora

How would you migrate several million databases, with minimal impact to your users?

Atlassian allocates one Postgres database per tenant customer, with a few thousand colocated on each RDS instance. This migration story was a riveting read!

Pat Rubis — Atlassian

“What went well” is more than just a pat on the back

Here’s my claim: providing details on how things went well will reduce your future mitigation time even more than focusing on what went wrong.

Lorin Hochstein

How we built reliable log delivery to thousands of unpredictable endpoints

My favorite part of this article was the explanation of how they handle pent-up logs when a customer’s endpoint recovers, without overwhelming the endpoint.

Gabriel Reid — Datadog

Surprise Surprise: When Reality Doesn’t Read the Runbook

How do you deal with fundamental surprise? This article introduces the concept of surprise², an incident you couldn’t see coming. Click through for some strategies to handle the inevitable occasional fundamentally surprising incident.

Stuart Rimell — Uptime Labs

How We Broke the Monolith (and Kept Our Sanity): Lessons From Moving to Microservices

A team found themselves needing to switch to microservices, and they chronicled their approach and results. I really like the section on the surprises they encountered.

Shushyam Malige Sharanappa — DZone

Seventh-generation server hardware at Dropbox: our most efficient and capable architecture yet

Dropbox shares what went into the rollout of their new fleet, including careful management of heat, vibration, and power.

Eric Shobe and Jared Mednick — Dropbox

Three Mighty Alerts Supporting Hugging Face’s Production Infrastructure

In this blog post, we’ll dive into the details of three mighty alerts that play their unique role in supporting our production infrastructure, and explore how they’ve helped us maintain the high level of performance and uptime that our community relies on.

…plus one bonus alert!

Jeremy Udit — Hugging Face

Our Experience with Amazon Aurora Blue/Green Deployments

Klaviyo adopted RDS’s blue/green deployment feature to make MySQL version upgrades much less painful. In this article they share their path to blue/green deployment and their results.

Marc Dellavolpe — Klaviyo

SRE Weekly Issue #484

lex

July 6, 2025

General

Comments

View on sreweekly.com

Exact Code Search: Find code faster across repositories

This is really neat! They’ve developed a new approach to search that uses 3-letter “trigrams” rather than tokenizing words, making it especially well-suited to code search. It converts regular expressions to trigram searches behind the scenes.

Dmitry Gruzd — GitLab

Pattern machines that we don’t understand

This article about LLMs is by a regularly featured author here in the newsletter. It’s not, strictly speaking, directly SRE-related, but I really got a lot out of it, so I’m including it anyway.

Lorin Hochstein

Soft vs. Hard Dependency: A Better Way to Think About Dependencies for More Reliable Systems

This one explains the difference between a soft and hard dependency, why it matters, and how to use this information to improve reliability. I like the section on soft dependencies evolving into hard dependencies when you’re not looking.

Teiva Harsanyi — The Coder Cafe

Breaking up a monolith: How we’re unwinding a shared database at scale

In this post, we’ll walk through how we’re splitting apart our shared database into independently owned instances. We’ll explain how we defined the right boundaries, minimized risk during migrations, and built the tooling to make the process safe and scalable.

Fabiana Scala and Tali Gutman — Datadog

Big Enough to Fail

At some point, the external dependencies which our systems rely on become so tightly coupled, large, and fundamental that should those foundations inevitably fail, that blame can actually go down in response to an incident.

This thought-provoking article explores why we’re more tolerant of outages from large tech companies like Google Cloud or Salesforce, and what this means for how we think about reliability engineering and incident response.

Will Gallego

Use AWS FIS to test the resilience of self-managed Cassandra

This practical guide shows how to use AWS Fault Injection Service (FIS) to perform chaos engineering experiments on self-managed Cassandra clusters. It walks through setting up experiments to test node failure scenarios and validate that applications can properly handle database outages through connection pooling and retry mechanisms.

Hans Nesbitt and Lwanga Phillip — AWS

Building a Billing Usage Recovery System

Klaviyo shares how they built an automated recovery system to handle billing usage tracking failures. The system uses S3 for data storage and SQS for message queuing to ensure that missed usage events are automatically recovered, eliminating manual intervention and reducing customer confusion.

Kaavya Antony — Klaviyo

Taming Complexity: HelloFresh’s Playbook for Managing Large-Scale Change (Part 3/3)

Final stretch! We’ve handled people and processes, now let’s crack the code side and stitch everything together into a four-stage framework you can reuse.

SRE Weekly Issue #488

SRE Weekly Issue #487

SRE Weekly Issue #486

SRE Weekly Issue #485

SRE Weekly Issue #484

Subscribe

RSS

Mastodon

Search Issues

A message from our sponsor, Observe, Inc.:

A message from our sponsor, Spacelift:

A message from our sponsor, Spacelift:

Subscribe

RSS

Mastodon

Search Issues