SRE Weekly Issue #500

A message from our sponsor, Depot:

Stop hunting through GitHub Actions logs. Depot now offers powerful CI log search across all your repositories and workflows. With smart filtering by timeframe, runner type, and keywords, you’ll have all the information at your fingertips to debug faster.

Wow, five hundred issues! I sent the first issue of SRE Weekly out almost exactly ten years ago. I assumed my little experiment would fairly quickly come to an end when I exhausted the supply of SRE-related articles.

I needn’t have worried. Somehow, the authors I’ve featured here have continued to produce a seemingly endless stream of excellent articles. If anything, the pace has only picked up over time! A profound thank you to all of the authors, without whom this newsletter would be just an empty bulleted list.

And thanks to you, dear readers, for making this worthwhile. Thanks for sharing the articles you find or write, I love receiving them! Thanks for the notes you send after an issue you particularly like, and the corrections too. Thanks for your kind well-wishes for my recent surgery, they meant a ton.

Finally, thanks to my sponsors, whose support makes all this possible. If you see something interesting, please give it a click and check it out!

When a scale-up event actually causes increased resource usage for awhile, a standard auto-scaling algorithm can fail.

   Minh Nhat Nguyen, Shi Kai Ng, and Calvin Tran — Grab

A database schema change added an index on a large table without using the CONCURRENTLY option, locking the table. This reminds me of a similar incident when I worked for Honeycomb and their solution.

  Ray Chen — Railway

Oof, that’s a harsh title, but this is a great discussion of how we strive to design for reliability even when our downstream vendors have outages.

  Uwe Friedrichsen

This one has a lot of good recommendations for staff-level SREs covering 8 areas, shared by a former Staff SRE.

  Karan Nagarajagowda

A high-throughput Java service was stalling. The culprit? Stop-the-World GC pauses were blocked by synchronous log writes to a busy disk.

   Nataraj Mocherla — DZone

This air accident report video by Mentour Pilot has a great example of alert fatigue around 30 minutes in. The air traffic controllers received enough spurious conflict alerts every day that they became easy to ignore.

  Mentour Pilot

In this post you learn:
* What are emergent properties and what kind of system has them?
* What are weak and strong emergence as opposed to resultant properties?
* How do emergent properties impact the reliability, maintainability, predictability, and cost of the system?

Well worth a read. It really got me thinking about emergence and its relationship to reliability.

  Alex Ewerlöf

In an incident, it’s important to have someone be in charge — and for it to be clear who that is, as explained in this article.

  Joe Mckevitt — Uptime Labs

SRE Weekly Issue #499

The folks at Uptime Labs and Advanced Capacity Labs have announced an advent calendar for this December.

Note: In order to take part, you’ll need to provide an email address to subscribe. I gave that some serious thought before including this here, but ultimately, I have a lot of trust for the folks at both ACL and Uptime Labs, since they’ve both produced so much awesome content that’s been featured here. I’m interested to see what this collab will bring!

  Uptime Labs and Adaptive Capacity Labs

Cool trick: divide short-term P95 latency by the long-term P95 to detect load spikes and adjust rate limits on-the-fly.

  Shravan Gaonkar — Airbnb

Datadog shares the bigger-picture lessons they learned and improvements they made since their major 2023 outage, including an emphasis on graceful degradation.

  Laura de Vesine, Rob Thomas, AND Maciej Kowalewski

This article does a really good job of laying out the problems with serverless that led them to leave: having to layer on significant complexity to deal with the limits of running in Cloudflare workers.

  Andreas Thomas — Unkey

This article explains the two concepts of reliability and fault tolerance and how they relate.

  Oakley Hall

This one could easily be titled, “Today, major system failures meant that I was able to take down production really easily.” There’s some great discussion in the comments, and I hope the author feels better.

  u/Deep-Jellyfish-2383 and others — reddit

Slack shows how they changed their monolithic Chef cookbook change deployment process to reduce risk, by breaking production up into 6 separate environments.

  Archie Gunasekara — Slack

The author discusses reasons why engineer attrition won’t appear in a public incident write-up, and may well not appear in a private one, either.

  Lorin Hochstein

SRE Weekly Issue #498

A message from our sponsor, Costory:

You didn’t sign up to do FinOps. Costory automatically explains why your cloud costs change, and reports it straight to Slack. Built for SREs who want to code, not wrestle with spreadsheets. Now on AWS & GCP Marketplaces.

Start your free trial at costory.io

Cloudflare had a major incident this week, and they say it was their worst since 2019. In this report, they explain what happened, and the failure mode is pretty interesting.

  Matthew Prince — Cloudflare

How we completely rearchitected Mussel, our storage engine for derived data, and lessons learned from the migration from Mussel V1 to V2.

They cover not just the motivation for and improvements in V2, but also the migration process to deploy V2 without interruption.

  Shravan Gaonkar — Airbnb

Netflix’s WAL service acts as a go-between, streaming data to pluggable targets while providing extra functionality like retries, delayed sending, and a dead-letter queue.

  Prudhviraj Karumanchi, Samuel Fu, Sriram Rangarajan, Vidhya Arvind, Yun Wang, and John Lu — Netflix

A (very) deep dive into Datadog’s custom data store, with special attention to how it handles query planning and optimization.

  Sami Tabet — Datadog

Perhaps we should encourage people to write their incident reports as if they will be consumed by an AI SRE tool that will use it to learn as much as possible about the work involved in diagnosing and remediating incidents in your company.

  Lorin Hochstein

we landed on a two-level failure capture design that combines Kafka topics with an S3 backup to ensure no event is ever lost.

  Tanya Fesenko, Collin Crowell, Dmitry Mamyrin, and Chinmay Sawaji — Klaviyo

Buried in this one is this gem: the last layer of reliability is that their client library automatically retries to alternate regions if the main region fails.

  Paddy Byers — Ably

incident.io shares details on how they fared during the AWS us-east-1 incident on October 20.

  Pete Hamilton — incident.io

SRE Weekly Issue #497

A message from our sponsor, Costory:

You didn’t sign up to do FinOps.
Costory automatically explains why your cloud costs change, and reports it straight to Slack.
Built for SREs who want to code, not wrestle with spreadsheets.
Now on AWS & GCP Marketplaces.

Start your free trial at costory.io

A thoughtful framework for evaluating the risk in using AI coding tools, centering around the probability, detectability, and impact of errors.

  Birgitta Böckeler — martinfowler.com

Cloudflare does some really fascinating things with networking. Here’s a deep dive on how they solved a problem in their implementation of sharing IP addresses across machines.

  Chris Branch — Cloudflare

I especially like how they nail down what exactly counts as “zero downtime” in the migration. They did allow some kinds of degradation.

  Anna Dowling — Tines

We’re always making tradeoffs in our systems (and companies). Incidents can help us see whether we’re making the right ones and how our decisions have played out.

  Fred Hebert

Fixation on a plan, on a model of the system, or on a theory of the cause, is a major risk in incident response.

  Lorin Hochstein

how do you design a system with events that have different SLO requirements?

They added a proxy layer on the consumer side to allow parallel processing within partitions, to avoid head-of-line blocking.

  Rohit Pathak, Tanya Fesenko, Collin Crowell, and Dmitry Mamyrin — Klaviyo

A database schema change was unintentionally reverted, and a subsequent thundering herd exacerbated the impact.

  Ray Chen — Railway

Recently, we had to upgrade a heavily loaded PostgreSQL cluster from version 13 to 16 while keeping downtime minimal. The cluster, consisting of a master and a replica, was handling over 20,000 transactions per second.

  Timur Nizamutdinov — Palark

SRE Weekly Issue #496

A message from our sponsor, CodeRabbit:

CodeRabbit is your AI co-pilot for code reviews. Get instant code review feedback, one-click fix suggestions and define custom rules with AST Grep to catch subtle issues static tools miss. Trusted across 1M repos and 70K open-source projects.

Get Started Today

Progressive rollouts may seem like a great strategy to reduce risk, but this article explains some hidden difficulties. For example, a slow rollout can obscure a problem or make it more difficult to detect.

  Lorin Hochstein

A fun HTTP/2 debugging journey, complete with a somewhat ridiculous solution: read the don’t forget to zero-length response body.

  Lucas Pardue and Zak Cutner — Cloudflare

I know that title sounds like a Listicle, but I can tell that this list of canary metrics came from hard-won experience.

   Sascha Neumeier — DZone

This post focuses on the human systems that turn observability into reliability. You’ll see how to define SLOs that drive decisions, build runbooks that scale team knowledge, structure post-mortems that generate improvements and embed these practices into engineering culture without adding bureaucracy.

  Fatih Koç

You don’t have to be a mathematician, but understanding a few key concepts is critical for an SRE.

  Srivatsa RV — One2N

Outputs are non-deterministic, data pipelines shift underfoot, and key components behave like black boxes. As a result, many of the tools and rituals SREs have mastered for decades no longer map cleanly to production AI.

This is a summary of a panel discussion from SREcon EMEA 2025 on how SREs can adapt to LLMs.

  Sylvain Kalache — The New Stack

This nifty tool lets you to inject all sorts of faults into a TCP stream and see what happens. It’s in userland, so it’s much easier to use than Linux’s traffic shaper.

  Viacheslav Biriukov

This one starts with an on-call horror story, but fortunately it also has useful tips for improving on-call health.

  Stuart Rimell — Uptime Labs

A production of Tinker Tinker Tinker, LLC Frontier Theme