SRE Weekly Issue #497

A message from our sponsor, Costory:

You didn’t sign up to do FinOps.
Costory automatically explains why your cloud costs change, and reports it straight to Slack.
Built for SREs who want to code, not wrestle with spreadsheets.
Now on AWS & GCP Marketplaces.

Start your free trial at costory.io

A thoughtful framework for evaluating the risk in using AI coding tools, centering around the probability, detectability, and impact of errors.

  Birgitta Böckeler — martinfowler.com

Cloudflare does some really fascinating things with networking. Here’s a deep dive on how they solved a problem in their implementation of sharing IP addresses across machines.

  Chris Branch — Cloudflare

I especially like how they nail down what exactly counts as “zero downtime” in the migration. They did allow some kinds of degradation.

  Anna Dowling — Tines

We’re always making tradeoffs in our systems (and companies). Incidents can help us see whether we’re making the right ones and how our decisions have played out.

  Fred Hebert

Fixation on a plan, on a model of the system, or on a theory of the cause, is a major risk in incident response.

  Lorin Hochstein

how do you design a system with events that have different SLO requirements?

They added a proxy layer on the consumer side to allow parallel processing within partitions, to avoid head-of-line blocking.

  Rohit Pathak, Tanya Fesenko, Collin Crowell, and Dmitry Mamyrin — Klaviyo

A database schema change was unintentionally reverted, and a subsequent thundering herd exacerbated the impact.

  Ray Chen — Railway

Recently, we had to upgrade a heavily loaded PostgreSQL cluster from version 13 to 16 while keeping downtime minimal. The cluster, consisting of a master and a replica, was handling over 20,000 transactions per second.

  Timur Nizamutdinov — Palark

SRE Weekly Issue #496

A message from our sponsor, CodeRabbit:

CodeRabbit is your AI co-pilot for code reviews. Get instant code review feedback, one-click fix suggestions and define custom rules with AST Grep to catch subtle issues static tools miss. Trusted across 1M repos and 70K open-source projects.

Get Started Today

Progressive rollouts may seem like a great strategy to reduce risk, but this article explains some hidden difficulties. For example, a slow rollout can obscure a problem or make it more difficult to detect.

  Lorin Hochstein

A fun HTTP/2 debugging journey, complete with a somewhat ridiculous solution: read the don’t forget to zero-length response body.

  Lucas Pardue and Zak Cutner — Cloudflare

I know that title sounds like a Listicle, but I can tell that this list of canary metrics came from hard-won experience.

   Sascha Neumeier — DZone

This post focuses on the human systems that turn observability into reliability. You’ll see how to define SLOs that drive decisions, build runbooks that scale team knowledge, structure post-mortems that generate improvements and embed these practices into engineering culture without adding bureaucracy.

  Fatih Koç

You don’t have to be a mathematician, but understanding a few key concepts is critical for an SRE.

  Srivatsa RV — One2N

Outputs are non-deterministic, data pipelines shift underfoot, and key components behave like black boxes. As a result, many of the tools and rituals SREs have mastered for decades no longer map cleanly to production AI.

This is a summary of a panel discussion from SREcon EMEA 2025 on how SREs can adapt to LLMs.

  Sylvain Kalache — The New Stack

This nifty tool lets you to inject all sorts of faults into a TCP stream and see what happens. It’s in userland, so it’s much easier to use than Linux’s traffic shaper.

  Viacheslav Biriukov

This one starts with an on-call horror story, but fortunately it also has useful tips for improving on-call health.

  Stuart Rimell — Uptime Labs

SRE Weekly Issue #495

I’m back! Kidney donation was a fascinating and rewarding experience, and I encourage you to learn more. It’s amazing how it’s possible to fix one human with spare parts from another!

I’ll share more about my experience later, but for now: thank you to the many of you that reached out with well-wishes. I’m feeling great and recovering nicely. I used the National Kidney Registry’s Voucher Program, allowing me to donate my kidney now and complete my healing while the NKR works to find a blood-type matched kidney for my intended recipient. It’s an incredible system.

I’m slowly catching up on the many SRE-related articles posted during my hiatus. If you’ve sent me links, thank you so much, and please understand that I’m woefully behind on my inbox, but I’ll review your suggestion soon!

 

Human error? Perhaps, but there were multiple compounding factors in this airplane incident, including sleep debt, circadian rhythms, an inoperative thrust reverser, and normalization of deviance.

  David Kaminski-Morrow — FlightGlobal

This is a technical report on three bugs that intermittently degraded responses from Claude. Below we explain what happened, why it took time to fix, and what we’re changing.

I especially like the section, “Why detection was difficult”.

  Anthropic

While I was out, I definitely heard about the bit AWS us-east-1 outage! Here’s Amazon’s write-up of the incident, involving a latent race condition.

  Amazon

I really love this analysis of the AWS us-east-1 outage. It’s Lorin’s Law once again: an infrastructure feature designed to improve reliability is implicated in an incident.

  Lorin Hochstein

Ouch! We should exercise caution when ascribing actions like “lying” and “covering tracks” to LLM-based agents — and of course when giving such agents deep access to modify our systems.

  Bruce Gil — Gizmodo

This post delves into our journey to transform incident management from a centralized function into a widespread, accessible practice and the hard-won lessons we’ve learned along the way.

They built a paved path based on Incident.io that any of their teams could use to manage an incident.

  Molly Struve — Netflix

If someone did something wrong, then it’s vital to understand why they did it.

My favorite part of this article is the common list of reasons people violate procedures.

  NorthStandard

A detailed description of Cloudflare’s new R2 SQL service that provides serverless querying across data in their object store service. This article helped me understand things I hadn’t really grasped before about how columnar datastores work.

  Yevgen Safronov, Nikita Lapkov, Jérôme Schneider — Cloudflare

SRE Weekly Issue #494

SRE Weekly will be on hiatus for the next 6 weeks while I’m on medical leave.

If all goes to plan, I’ll be donating a kidney for a loved one later this week, reducing my internal redundancy to help them respond to their own internal renal incident. If you’re interested, I invite you to learn more about kidney donation. It’s fascinating!



Courtney Nash over at The VOID has launched an in-depth survey of incident management practices in tech. Please consider taking the time to fill out this survey. We all stand to benefit hugely from the information it will gather.

  Courtney Nash

Speaking of The VOID, the first bit of the September issue of the VOID Newsletter stood out to me:

Back in June, Salesforce had what appeared to be a pretty painful Heroku outage. About a month later, tech blogger Gergely Orosz posted about the incident on BlueSky. I’m bringing this up now because I’ve had over a month to chew on his commentary and I’m still mad about it. As someone who deals in reading public incident reports as a primary feature of my work, I find nothing more infuriating than people arm chair quarterbacking other organizations’ incidents and presuming they actually have any idea_ what really happened_.

As it happens, I also commented on the similarity of Salesforce’s incident to a Datadog incident from the past in issue 482.

I’m with Courtney Nash: we really have to be careful how we opine on public incident write-ups. Not only is it important to avoid blame and hindsight bias, but we also need to be careful not to disincentivize companies from posting public incident write-ups. I highly recommend clicking through to read Courtney’s full analysis.

  Courtney Nash

This guide explains what error budgets are, how to manage them effectively, what to look out for, and how they differ from SLOs.

Includes sections on potential pitfalls, real-world examples, and impact on company culture.

  Nawaz Dhandala — OneUpime

This article explores how backend engineers and DevOps teams can detect, debug, and prevent message loss in Kafka-based streaming pipelines using tools like OpenTelemetry, Fluent Bit, Jaeger, and dead-letter queues.

   Prakash Wagle — DZone

Faced with 80 million time series, these folks found that Statsd + InfluxDB weren’t cutting it, so they switched to Prometheus.

Accessibility note: this article contains a table of text in an image with no alt text.

  Kapil

How do these folks keep producing such detailed write-ups the day after an incident?

  Tom Lianza and Joaquin Madruga — Cloudflare

The author ties a recent outage in San Francisco’s BART transit service to a couple of previous incidents by a common thread: confidence placed in a procedure that had been performed successfully previously.

This article also links to BART’s memo which is surprisingly detailed and a great read.

  Lorin Hochstein

The folks at Graphite take us through their discovery of why code search is difficult and the strategies they employed to solve it.

  Brandon Willett — Graphite

SRE Weekly Issue #493

A message from our sponsor, Shipfox:

Shipfox supercharges GitHub Actions – no workflow changes, 30-min setup.

  • 2x faster builds with better CPU, faster disks & high-throughput caching
  • 75% lower costs with shorter jobs and better price-per-performance
  • Full CI observability with test/job speed and reliability

👉 See how it works: https://www.shipfox.io?utm_source=SREWeekly&utm_campaign=issue493

I like how this goes deep on the ways proxies can manage many connections at once, like SO_REUSEPORT.

  Mitendra Mahto

Here I want to talk about two classes of problems where accountability is a poor solution to addressing the problem, where the OceanGate accident falls into the second class.

  Lorin Hochstein

This one has so many lessons we can learn from that it might as well be about IT infrastructure.

Research from high-reliability organizations reveals that individual errors are almost always symptoms of deeper systemic problems.

   Muhammad Abdullah Khan — KevinMD.com

But Vibe Coding introduces real risks, particularly around resilience, that are worth examining before we place too much faith in it.

I really appreciate the way the author methodically lays out their points, including through the concept of competitive and complementary artifacts.

  Stuart Rimell — Uptime Labs

I really enjoyed the part about the Google interview question. That one’s going to have me thinking for awhile.

  Jos Visser

Here’s a great overview of why time and ordering are so important (and difficult) in distributed systems.

  Sid — The Scalable Thread

I really like the way the author teases apart the true, practical meaning of “eventual consistency”. The example of Amazon shopping carts is especially illuminating.

  Uwe Friedrichsen

A production of Tinker Tinker Tinker, LLC Frontier Theme