SRE Weekly Issue #499

The folks at Uptime Labs and Advanced Capacity Labs have announced an advent calendar for this December.

Note: In order to take part, you’ll need to provide an email address to subscribe. I gave that some serious thought before including this here, but ultimately, I have a lot of trust for the folks at both ACL and Uptime Labs, since they’ve both produced so much awesome content that’s been featured here. I’m interested to see what this collab will bring!

  Uptime Labs and Adaptive Capacity Labs

Cool trick: divide short-term P95 latency by the long-term P95 to detect load spikes and adjust rate limits on-the-fly.

  Shravan Gaonkar — Airbnb

Datadog shares the bigger-picture lessons they learned and improvements they made since their major 2023 outage, including an emphasis on graceful degradation.

  Laura de Vesine, Rob Thomas, AND Maciej Kowalewski

This article does a really good job of laying out the problems with serverless that led them to leave: having to layer on significant complexity to deal with the limits of running in Cloudflare workers.

  Andreas Thomas — Unkey

This article explains the two concepts of reliability and fault tolerance and how they relate.

  Oakley Hall

This one could easily be titled, “Today, major system failures meant that I was able to take down production really easily.” There’s some great discussion in the comments, and I hope the author feels better.

  u/Deep-Jellyfish-2383 and others — reddit

Slack shows how they changed their monolithic Chef cookbook change deployment process to reduce risk, by breaking production up into 6 separate environments.

  Archie Gunasekara — Slack

The author discusses reasons why engineer attrition won’t appear in a public incident write-up, and may well not appear in a private one, either.

  Lorin Hochstein

SRE Weekly Issue #498

A message from our sponsor, Costory:

You didn’t sign up to do FinOps. Costory automatically explains why your cloud costs change, and reports it straight to Slack. Built for SREs who want to code, not wrestle with spreadsheets. Now on AWS & GCP Marketplaces.

Start your free trial at costory.io

Cloudflare had a major incident this week, and they say it was their worst since 2019. In this report, they explain what happened, and the failure mode is pretty interesting.

  Matthew Prince — Cloudflare

How we completely rearchitected Mussel, our storage engine for derived data, and lessons learned from the migration from Mussel V1 to V2.

They cover not just the motivation for and improvements in V2, but also the migration process to deploy V2 without interruption.

  Shravan Gaonkar — Airbnb

Netflix’s WAL service acts as a go-between, streaming data to pluggable targets while providing extra functionality like retries, delayed sending, and a dead-letter queue.

  Prudhviraj Karumanchi, Samuel Fu, Sriram Rangarajan, Vidhya Arvind, Yun Wang, and John Lu — Netflix

A (very) deep dive into Datadog’s custom data store, with special attention to how it handles query planning and optimization.

  Sami Tabet — Datadog

Perhaps we should encourage people to write their incident reports as if they will be consumed by an AI SRE tool that will use it to learn as much as possible about the work involved in diagnosing and remediating incidents in your company.

  Lorin Hochstein

we landed on a two-level failure capture design that combines Kafka topics with an S3 backup to ensure no event is ever lost.

  Tanya Fesenko, Collin Crowell, Dmitry Mamyrin, and Chinmay Sawaji — Klaviyo

Buried in this one is this gem: the last layer of reliability is that their client library automatically retries to alternate regions if the main region fails.

  Paddy Byers — Ably

incident.io shares details on how they fared during the AWS us-east-1 incident on October 20.

  Pete Hamilton — incident.io

SRE Weekly Issue #497

A message from our sponsor, Costory:

You didn’t sign up to do FinOps.
Costory automatically explains why your cloud costs change, and reports it straight to Slack.
Built for SREs who want to code, not wrestle with spreadsheets.
Now on AWS & GCP Marketplaces.

Start your free trial at costory.io

A thoughtful framework for evaluating the risk in using AI coding tools, centering around the probability, detectability, and impact of errors.

  Birgitta Böckeler — martinfowler.com

Cloudflare does some really fascinating things with networking. Here’s a deep dive on how they solved a problem in their implementation of sharing IP addresses across machines.

  Chris Branch — Cloudflare

I especially like how they nail down what exactly counts as “zero downtime” in the migration. They did allow some kinds of degradation.

  Anna Dowling — Tines

We’re always making tradeoffs in our systems (and companies). Incidents can help us see whether we’re making the right ones and how our decisions have played out.

  Fred Hebert

Fixation on a plan, on a model of the system, or on a theory of the cause, is a major risk in incident response.

  Lorin Hochstein

how do you design a system with events that have different SLO requirements?

They added a proxy layer on the consumer side to allow parallel processing within partitions, to avoid head-of-line blocking.

  Rohit Pathak, Tanya Fesenko, Collin Crowell, and Dmitry Mamyrin — Klaviyo

A database schema change was unintentionally reverted, and a subsequent thundering herd exacerbated the impact.

  Ray Chen — Railway

Recently, we had to upgrade a heavily loaded PostgreSQL cluster from version 13 to 16 while keeping downtime minimal. The cluster, consisting of a master and a replica, was handling over 20,000 transactions per second.

  Timur Nizamutdinov — Palark

SRE Weekly Issue #496

A message from our sponsor, CodeRabbit:

CodeRabbit is your AI co-pilot for code reviews. Get instant code review feedback, one-click fix suggestions and define custom rules with AST Grep to catch subtle issues static tools miss. Trusted across 1M repos and 70K open-source projects.

Get Started Today

Progressive rollouts may seem like a great strategy to reduce risk, but this article explains some hidden difficulties. For example, a slow rollout can obscure a problem or make it more difficult to detect.

  Lorin Hochstein

A fun HTTP/2 debugging journey, complete with a somewhat ridiculous solution: read the don’t forget to zero-length response body.

  Lucas Pardue and Zak Cutner — Cloudflare

I know that title sounds like a Listicle, but I can tell that this list of canary metrics came from hard-won experience.

   Sascha Neumeier — DZone

This post focuses on the human systems that turn observability into reliability. You’ll see how to define SLOs that drive decisions, build runbooks that scale team knowledge, structure post-mortems that generate improvements and embed these practices into engineering culture without adding bureaucracy.

  Fatih Koç

You don’t have to be a mathematician, but understanding a few key concepts is critical for an SRE.

  Srivatsa RV — One2N

Outputs are non-deterministic, data pipelines shift underfoot, and key components behave like black boxes. As a result, many of the tools and rituals SREs have mastered for decades no longer map cleanly to production AI.

This is a summary of a panel discussion from SREcon EMEA 2025 on how SREs can adapt to LLMs.

  Sylvain Kalache — The New Stack

This nifty tool lets you to inject all sorts of faults into a TCP stream and see what happens. It’s in userland, so it’s much easier to use than Linux’s traffic shaper.

  Viacheslav Biriukov

This one starts with an on-call horror story, but fortunately it also has useful tips for improving on-call health.

  Stuart Rimell — Uptime Labs

SRE Weekly Issue #495

I’m back! Kidney donation was a fascinating and rewarding experience, and I encourage you to learn more. It’s amazing how it’s possible to fix one human with spare parts from another!

I’ll share more about my experience later, but for now: thank you to the many of you that reached out with well-wishes. I’m feeling great and recovering nicely. I used the National Kidney Registry’s Voucher Program, allowing me to donate my kidney now and complete my healing while the NKR works to find a blood-type matched kidney for my intended recipient. It’s an incredible system.

I’m slowly catching up on the many SRE-related articles posted during my hiatus. If you’ve sent me links, thank you so much, and please understand that I’m woefully behind on my inbox, but I’ll review your suggestion soon!

 

Human error? Perhaps, but there were multiple compounding factors in this airplane incident, including sleep debt, circadian rhythms, an inoperative thrust reverser, and normalization of deviance.

  David Kaminski-Morrow — FlightGlobal

This is a technical report on three bugs that intermittently degraded responses from Claude. Below we explain what happened, why it took time to fix, and what we’re changing.

I especially like the section, “Why detection was difficult”.

  Anthropic

While I was out, I definitely heard about the bit AWS us-east-1 outage! Here’s Amazon’s write-up of the incident, involving a latent race condition.

  Amazon

I really love this analysis of the AWS us-east-1 outage. It’s Lorin’s Law once again: an infrastructure feature designed to improve reliability is implicated in an incident.

  Lorin Hochstein

Ouch! We should exercise caution when ascribing actions like “lying” and “covering tracks” to LLM-based agents — and of course when giving such agents deep access to modify our systems.

  Bruce Gil — Gizmodo

This post delves into our journey to transform incident management from a centralized function into a widespread, accessible practice and the hard-won lessons we’ve learned along the way.

They built a paved path based on Incident.io that any of their teams could use to manage an incident.

  Molly Struve — Netflix

If someone did something wrong, then it’s vital to understand why they did it.

My favorite part of this article is the common list of reasons people violate procedures.

  NorthStandard

A detailed description of Cloudflare’s new R2 SQL service that provides serverless querying across data in their object store service. This article helped me understand things I hadn’t really grasped before about how columnar datastores work.

  Yevgen Safronov, Nikita Lapkov, Jérôme Schneider — Cloudflare

A production of Tinker Tinker Tinker, LLC Frontier Theme