SRE Weekly Issue #489

A message from our sponsor, Observe, Inc.:

Observe‘s free Masterclass in Observability at Scale is coming on September 4th at 10am Pacific! We’ll explore how to architect for observability at scale – from streaming telemetry and open data lakes to AI agents that proactively instrument your code and surface insights.

Learn more and register today!

As we learn advanced resilience engineering concepts, this article recommends that we take a balanced approach in how we try to change existing practices.

I can confidently say that when an executive leader wants to be talking about quality of service for your customers, the last thing they want to hear about is academic papers and Monte Carlo simulations.

  Michelle Casey Resilience in Software Foundation

I know you probably know all about how hashing works, but this one’s still worth a read. The article includes interactive demonstrations and clearly presents concepts to help you understand how hashing function performance is evaluated.

  Sam Rose

Pulled from the Internet Archive, here’s a story of how the now-defunct Parse rewrote their Ruby on Rails API in Golang, significantly improving reliability.

  Charity Majors

We are sharing methodologies we deploy at various scales for detecting SDC [Silent Data Corruption] across our AI and non-AI infrastructure to help ensure the reliability of AI training and inference workloads across Meta.

  Harish Dattatraya Dixit and Sriram Sankar — Meta

As monday.com broke their monolith up into microservices, their number of databases expanded too. To have a chance of managing all of them, they shifted from DBA practices to DBRE.

  Mateusz Wojciechowski — monday.com

Airbnb runs a large-scale database on Kubernetes. They have various techniques to deal with the ephemerality of pods and the risks inherent in cluster upgrades.

  Artem Danilov — Airbnb

The author of this article brings us along as they do a very thorough evaluation of K8sGPT, showing us what it can do and some ways in which it can fall short.

  Evgeny Torin — Palark

What is good incident communication? This article draws on theory from Herbert Clark’s Joint Action Ladder to help us evaluate and strengthen communication.

  Stuart Rimell — Uptime Labs

SRE Weekly Issue #488

A message from our sponsor, Observe, Inc.:

Observe‘s free Masterclass in Observability at Scale is coming on September 4th at 10am Pacific! We’ll explore how to architect for observability at scale – from streaming telemetry and open data lakes to AI agents that proactively instrument your code and surface insights.

Learn more and register today!

A story of the failure of a pumped energy storage facility, involving all of our favorite features like complex contributing factors, work-as-done vs work-as-designed, and early warning signs only obvious in hindsight. As a bonus, no one was killed.

  Practical Engineering

Nebula‘s streaming service has a surprisingly write-heavy workflow owing to storing bookmarks of the latest point a given user has watched in a video. That makes scaling an interesting challenge.

   Sam Rose — Nebula

I love the debugging technique they used: kill processes one at a time until performance improves.

  Samson Hu, Shashank Tavildar, Eric Kalkanger, and Hunter Gatewood — Pinterest

This article is about finding the balance between having enough process to ensure incident response goes smoothly, and having so much process that incident responders are unable to adapt to unexpected situations.

  Brandon Chalk — Rootly

This article presents two case studies of dialog during incidents along with analysis of each. How does your own analysis compare?

  Hamed Silatani — Uptime Labs

They realized that a single alert can’t catch both a sudden AC failure and an AC that becomes slowly but steadily overwhelmed.

  Chris Siebenmann

Thoughts on migrations as a significant source of reliability risk.

[…] engineering organizations at tech companies need to make migrations a part of their core competency, rather than seeing them as one-off chores.

  Lorin Hochstein

An incorrect physical disconnection was made to the active network switch serving our control plane, rather than the redundant unit scheduled for removal.

This reminds me of wrong-side surgery incidents and aircraft pilots shutting off the good engine when one fails.

  Google

SRE Weekly Issue #487

A message from our sponsor, Spacelift:

IaC Experts! IaCConf Call for Presenters – August 27, 2025
The upcoming IaCConf Spotlight dives into the security and governance challenges of managing infrastructure as code at scale. From embedding security in your pipelines to navigating the realities of open source risk, this event brings together practitioners who are taking a security-minded approach to how they implement IaC in their organization.

Call for Presenters is now open until Friday, August 1. Submit your CFP or register for the free event today.

Join the Free Virtual Event

Pinterest decided to replace their Hadoop+Spark-based data processing pipeline with one based on Kubernetes.

In part one, we provide rationale for our new technical direction prior to outlining the overall design and detailing the application focused layer of our platform. We conclude with current status and some of our learnings.

  Soam Acharya, Rainie Li., William Tom, and Ang Zhang — Pinterest

This article raises some important concerns that are worth thinking about.

It’s fast and feels efficient, but it masks a drop in codebase familiarity. Over time, your top engineers stop being system experts.

  Alexander Procter — Okoone

I really love the care taken in this article to consider the potential risks of AI tools for incident response. There are many valuable insights that make this article way more than just a sales pitch for their tool.

  Chris Evans — incident.io

Quicksilver a globally distributed key-value store serving billions of requests per second where speed is critical, so you know the scaling challenges are going to be interesting.

  Marten van de Sanden and Anton Dort-Golts — Cloudflare

This article gives reproducible cases in which MySQL and Postgres can reuse auto-increment IDs.

I think I’ve seen this advice violated at nearly every company I’ve worked at:

Best practice dictates that you shouldn’t be using IDs from database tables outside of that table unless it’s some foreign key field

  Sam Rose

Here’s a great explanation of why it’s often better to use for_each instead of count in Terraform.

  Ned Bellavance

This debugging story really drew me in. It’s so incredibly satisfying the way their initial theory was confirmed so tidily in the end.

  Nayef Ghattas — Datadog

In our latest Rootly roundtable, we sat down with a group of seasoned SREs (collectively packing over 100 years of ops scars) to trade notes on what makes an alert useful, what makes it noise, and how to build alerting systems that teams can trust.

Here are their top strategies distilled for you:

  Jorge Lainfiesta — Rootly

SRE Weekly Issue #486

A message from our sponsor, Spacelift:

IaC Experts! IaCConf Call for Presenters – August 27, 2025

The upcoming IaCConf Spotlight dives into the security and governance challenges of managing infrastructure as code at scale. From embedding security in your pipelines to navigating the realities of open source risk, this event brings together practitioners who are taking a security-minded approach to how they implement IaC in their organization.

Call for Presenters is now open until Friday, August 1. Submit your CFP or register for the free event today.

https://events.iacconf.com/iac-security-spotlight-august-2025/?utm_medium=email&utm_source=sreweekly

For his hundredth(!) episode of Slight Reliability, Stephen Townsend has an awesome chat with John Allspaw. I especially loved the part where John pointed out that different people will get different “Aha Moments” from the same incident.

  Stephen Townshend

This article delves deep into the nuances of Recovery Time Objective and Recovery Point Objective and how to manage both without spending too much. There’s a strong theme of using feature flags as you might expect from this company, but this article goes beyond being just a one-dimensional product pitch.

  Jesse Sumrak — LaunchDarkly

A discussion of the qualities of a good alert and how to audit and improve your alerting.

  Hannah Roy — Tines

This one contrasts two views on latent defects in our systems, from Root Cause Analysis and Resilience Engineering perspectives. The RE perspective looks scary, but it’s much more nuanced than that.

  Lorin Hochstein

Grab has seen multiple scenarios in which concurrent cache writes result in inconsistent fares. This article explains their strategies for detecting and dealing with them.

   Ravi Teja Thutari — DZone

Adding a node to a CouchDB cluster went poorly, resulting in lost data in this incident from 2024.

The mistake we made in our automated process for adding nodes was to add the new node to our load balancer before it had fully synchronised.

  Sam Rose — Budibase

The parallels between this incident and the Budibase one above are striking! I swear it’s a coincidence that I came across both of these old incident reports in the same week.

  Chris Evans and Suhail Patel — Monzo

Another tricky failure mode for Cloudflare’s massive DNS resolver service. They share all the details in this post with their usual flare (sorry, I couldn’t resist).

  Ash Pallarito and Joe Abley — Cloudflare

SRE Weekly Issue #485

YOUR AD COULD BE HERE!

SRE Weekly has openings for new sponsorships. Reply or email lex at sreweekly.com for details.

How would you migrate several million databases, with minimal impact to your users?

Atlassian allocates one Postgres database per tenant customer, with a few thousand colocated on each RDS instance. This migration story was a riveting read!

  Pat Rubis — Atlassian

Here’s my claim: providing details on how things went well will reduce your future mitigation time even more than focusing on what went wrong.

  Lorin Hochstein

My favorite part of this article was the explanation of how they handle pent-up logs when a customer’s endpoint recovers, without overwhelming the endpoint.

  Gabriel Reid — Datadog

How do you deal with fundamental surprise? This article introduces the concept of surprise2, an incident you couldn’t see coming. Click through for some strategies to handle the inevitable occasional fundamentally surprising incident.

  Stuart Rimell — Uptime Labs

A team found themselves needing to switch to microservices, and they chronicled their approach and results. I really like the section on the surprises they encountered.

   Shushyam Malige Sharanappa — DZone

Dropbox shares what went into the rollout of their new fleet, including careful management of heat, vibration, and power.

  Eric Shobe and Jared Mednick — Dropbox

In this blog post, we’ll dive into the details of three mighty alerts that play their unique role in supporting our production infrastructure, and explore how they’ve helped us maintain the high level of performance and uptime that our community relies on.

…plus one bonus alert!

  Jeremy Udit — Hugging Face

Klaviyo adopted RDS’s blue/green deployment feature to make MySQL version upgrades much less painful. In this article they share their path to blue/green deployment and their results.

  Marc Dellavolpe — Klaviyo

A production of Tinker Tinker Tinker, LLC Frontier Theme