General

SRE Weekly Issue #502

lex

December 21, 2025

Eliminating Cold Starts 2: shard and conquer

Cloudflare reduced their cold-start rate for Workers requests through sharding and consistent hashing, with an interesting solution for load shedding.

Harris Hancock — Cloudflare

Monitoring & Observability: Using Logs, Metrics, Traces, and Alerts to Understand System Failures

I appreciate the way this article also shares how each of logs, metrics, traces, and alerts has its downsides, and what you can do instead. FYI, there’s also a fairly extensive product-specific second half about observabilty on Railway.

Mahmoud Abdelwahab — Railway

Uptime Labs: Building Expertise in Incident Response

I don’t often include direct product introductions like this explanation of Uptime Labs’s incident simulation platform from Adaptive Capacity Labs. I’m making an exception in this case because I feel that incident simulation has huge potential to improve reliability, and I see very few articles about it.

John Allspaw — Adaptive Capacity Labs

KISS vs DRY in Infrastructure as Code: Why Simple Often Beats Clever

IaC may bring more trouble than it solves, and it may simply move or hide complexity, according to this article.

RoseSecurity

Paper: The Failure Gap

[…] the failure gap, which is the idea that people vastly underestimate the actual number and rate of failures that happen in the world compared to successes.

Fred Hebert — summary

Lauren Eskreis-Winkler, Kaitlin Woolley, Minhee Kim, and Eliana Polimeni — original paper

What Now? Handling Errors in Large Systems

This one’s fun. You get to play along with the author, voting on an error handling strategy and then seeing what the author thinks and why.

Marc Brooker

Agent-Driven SRE Investigations: A Practical Deep Dive into Multi-Agent Incident Response

A chronicle of an sandboxed experiment in using multiple instances of Claude to investigate incidents. I like the level of detail and transparency in their experimental setup.

Ar Hakboian — OpsWorker.ai

Brief thoughts on the recent Cloudflare outage

I have a bit of an article backlog, so note that this is about the November outage, not the more recent outage on December 5.

Lorin Hochstein

SRE Weekly Issue #501

lex

December 14, 2025

General

Comments

View on sreweekly.com

AI and the ironies of automation – Part 1

A thoughtful evaluation of current trends in AI through the lens of Lisanne Bainbridge’s classic paper, The Ironies of Automation. I really got a lot out of this one.

Uwe Friedrichsen

100X Faster: How We Supercharged Netflix Maestro’s Workflow Engine

They supercharged the workflow engine by rewriting it. I like the way they explained why they settled on a full rewrite and the alternative options they considered.

Jun He, Yingyi Zhang, and Ely Spears — Netflix

How when AWS was down, we were not

This one goes deep on how to build a reliable service on unreliable parts. Can retries improve your overall reliability? What about the reliability of the retry system itself?

Warren Parad — Authress

Cold-Restart Resilience: Because ‘It Starts’ Doesn’t Mean ‘It Works’

In this article, we’ll explore how cold-restart dependencies form, why typical recovery designs break down, and what architectural principles can help systems warm up faster after a complete outage.

Bala Kambala

How Crisis Reveals the Truth About Complex Systems

This one goes into the qualities of a good post-incident review, the definition of resilience, and a discussion of blamelessness, drawing lessons from aviation.

Gamunu Balagalla — Uptime Labs

The Heartbreaking Story of BOAC flight 712

It would be easy to blame the poor outcome of BOAC 712’s engine failure on human error since the pilots missed key steps in their checklists. Instead, the NTSB cited systemic issues, resulting in improvements in checklists and other areas.

Mentour Pilot

Cloudflare outage on December 5, 2025

Cloudflare had another significant outage, though not as big as the one last month. This one was related to steps they took to mitigate the big React RCE vulnerability.

Dane Knecht — Cloudflare

Quick takes on the Dec 5 Cloudflare outage

Lorin’s whole analysis is awesome, but there’s an especially incisive section at the end that uses math to put Cloudflare’s run of 2 recent big incidents in perspective.

Lorin Hochstein

SRE Weekly Issue #500

lex

December 7, 2025

General

Comments

View on sreweekly.com

Wow, five hundred issues! I sent the first issue of SRE Weekly out almost exactly ten years ago. I assumed my little experiment would fairly quickly come to an end when I exhausted the supply of SRE-related articles.

I needn’t have worried. Somehow, the authors I’ve featured here have continued to produce a seemingly endless stream of excellent articles. If anything, the pace has only picked up over time! A profound thank you to all of the authors, without whom this newsletter would be just an empty bulleted list.

And thanks to you, dear readers, for making this worthwhile. Thanks for sharing the articles you find or write, I love receiving them! Thanks for the notes you send after an issue you particularly like, and the corrections too. Thanks for your kind well-wishes for my recent surgery, they meant a ton.

Finally, thanks to my sponsors, whose support makes all this possible. If you see something interesting, please give it a click and check it out!

Machine-learning predictive autoscaling for Flink

When a scale-up event actually causes increased resource usage for awhile, a standard auto-scaling algorithm can fail.

Minh Nhat Nguyen, Shi Kai Ng, and Calvin Tran — Grab

[Railway] Incident Report: October 28th, 2025

A database schema change added an index on a large table without using the CONCURRENTLY option, locking the table. This reminds me of a similar incident when I worked for Honeycomb and their solution.

Ray Chen — Railway

It is your fault if your application is down

Oof, that’s a harsh title, but this is a great discussion of how we strive to design for reliability even when our downstream vendors have outages.

Uwe Friedrichsen

Advice for First-Time Staff SREs

This one has a lot of good recommendations for staff-level SREs covering 8 areas, shared by a former Staff SRE.

Karan Nagarajagowda

The JVM Pause That Wasn’t: A War Story

A high-throughput Java service was stalling. The culprit? Stop-the-World GC pauses were blocked by synchronous log writes to a busy disk.

Nataraj Mocherla — DZone

The Tragedy of PSA Flight 182

This air accident report video by Mentour Pilot has a great example of alert fatigue around 30 minutes in. The air traffic controllers received enough spurious conflict alerts every day that they became easy to ignore.

Mentour Pilot

Emergent properties

In this post you learn:
* What are emergent properties and what kind of system has them?
* What are weak and strong emergence as opposed to resultant properties?
* How do emergent properties impact the reliability, maintainability, predictability, and cost of the system?

Well worth a read. It really got me thinking about emergence and its relationship to reliability.

Alex Ewerlöf

Who’s in Charge?

In an incident, it’s important to have someone be in charge — and for it to be clear who that is, as explained in this article.

Joe Mckevitt — Uptime Labs

SRE Weekly Issue #499

lex

November 30, 2025

General

Comments

View on sreweekly.com

Fix-mas Countdown

The folks at Uptime Labs and Advanced Capacity Labs have announced an advent calendar for this December.

Note: In order to take part, you’ll need to provide an email address to subscribe. I gave that some serious thought before including this here, but ultimately, I have a lot of trust for the folks at both ACL and Uptime Labs, since they’ve both produced so much awesome content that’s been featured here. I’m interested to see what this collab will bring!

Uptime Labs and Adaptive Capacity Labs

From Static Rate Limiting to Adaptive Traffic Management in Airbnb’s Key-Value Store

Cool trick: divide short-term P95 latency by the long-term P95 to detect load spikes and adjust rate limits on-the-fly.

Shravan Gaonkar — Airbnb

Failure is inevitable: Learning from a large outage, and building for reliability in depth at Datadog

Datadog shares the bigger-picture lessons they learned and improvements they made since their major 2023 outage, including an emphasis on graceful degradation.

Laura de Vesine, Rob Thomas, AND Maciej Kowalewski

Why we’re leaving serverless

This article does a really good job of laying out the problems with serverless that led them to leave: having to layer on significant complexity to deal with the limits of running in Cloudflare workers.

Andreas Thomas — Unkey

Reliability and Fault Tolerance

This article explains the two concepts of reliability and fault tolerance and how they relate.

Oakley Hall

r/sre: Today I caused a production incident with a stupid bug

This one could easily be titled, “Today, major system failures meant that I was able to take down production really easily.” There’s some great discussion in the comments, and I hope the author feels better.

u/Deep-Jellyfish-2383 and others — reddit

Advancing Our Chef Infrastructure: Safety Without Disruption

Slack shows how they changed their monolithic Chef cookbook change deployment process to reduce risk, by breaking production up into 6 separate environments.

Archie Gunasekara — Slack

You’ll never see attrition referenced in an RCA

The author discusses reasons why engineer attrition won’t appear in a public incident write-up, and may well not appear in a private one, either.

Lorin Hochstein

SRE Weekly Issue #498

lex

November 23, 2025

General

Comments

View on sreweekly.com

Cloudflare outage on November 18, 2025

Cloudflare had a major incident this week, and they say it was their worst since 2019. In this report, they explain what happened, and the failure mode is pretty interesting.

Matthew Prince — Cloudflare

Building a Next-Generation Key-Value Store at Airbnb

How we completely rearchitected Mussel, our storage engine for derived data, and lessons learned from the migration from Mussel V1 to V2.

They cover not just the motivation for and improvements in V2, but also the migration process to deploy V2 without interruption.

Shravan Gaonkar — Airbnb

Building a Resilient Data Platform with Write-Ahead Log at Netflix

Netflix’s WAL service acts as a go-between, streaming data to pluggable targets while providing extra functionality like retries, delayed sending, and a dead-letter queue.

Prudhviraj Karumanchi, Samuel Fu, Sriram Rangarajan, Vidhya Arvind, Yun Wang, and John Lu — Netflix

Inside Husky’s query engine: Real-time access to 100 trillion events

A (very) deep dive into Datadog’s custom data store, with special attention to how it handles query planning and optimization.

Sami Tabet — Datadog

Two thought experiments

Perhaps we should encourage people to write their incident reports as if they will be consumed by an AI SRE tool that will use it to learn as much as possible about the work involved in diagnosing and remediating incidents in your company.

Lorin Hochstein

Building a Resilient Event Publisher with Dual Failure Capture

we landed on a two-level failure capture design that combines Kafka topics with an S3 backup to ensure no event is ever lost.

Tanya Fesenko, Collin Crowell, Dmitry Mamyrin, and Chinmay Sawaji — Klaviyo

AWS us-east-1 outage: How Ably’s multi-region architecture held up

Buried in this one is this gem: the last layer of reliability is that their client library automatically retries to alternate regions if the main region fails.

Paddy Byers — Ably

Service disruption on October 20, 2025

incident.io shares details on how they fared during the AWS us-east-1 incident on October 20.

Pete Hamilton — incident.io

SRE Weekly Issue #502

SRE Weekly Issue #501

SRE Weekly Issue #500

SRE Weekly Issue #499

SRE Weekly Issue #498

Subscribe

RSS

Mastodon

Search Issues

General

A message from our sponsor, Depot:

A message from our sponsor, Depot:

A message from our sponsor, Costory:

Subscribe

RSS

Mastodon

Search Issues