SRE WEEKLY – scalability, availability, incident response, automation

SRE Weekly Issue #522

lex

June 21, 2026

Incident status updates are a translation problem, and the right translator probably isn’t in Engineering

[…] the fix isn’t “train your engineers to write better status updates.” The fix is to stop asking your engineers to write them, and start asking the right people instead.

Brent Chapman

Scaling Security Insights: how we achieved a 10x increase in global scanning capacity

A satisfying scaling story where every fix came from looking more closely at the system — Kafka head-of-line blocking, a clumpy scheduler, and an active-active API that silently doubled latency for half of all partitions.

Dave Baxter — Cloudflare

Vibe-Coded Infra Is Your New Reliability Hazard

Some good examples of risks in here, along with an interesting tendency to blame “user error”.

Prakshal Doshi — HackerNoon

An incident response playbook for satellite operations on AWS (Part-1): Detection and forensic readiness

Satellites present unique reliability constraints like limited data uplink windows and the risk of bricking a very expensive piece of equipment.

Author:

Incident Fest ’26

This looks fun! It’s a free virtual event on July 8.

Uptime Labs

The feedback loops behind Kubernetes

This article does a really great job of building up an explanation of feedback-based control and the difference between edge-triggered and level-triggered systems.

Fatih Arslan — PlanetScale

Dear researchers column

An open letter to software researchers to study incident response in software systems. It’s so cool how the author translates incident response concepts to researchers who may not be familiar, with examples.

Lorin Hochstein

Meet Alice. Alice is impatient.

An important concept: a user’s perception of your average outage duration is weighted and won’t match a flat average MTTR.

Marc Brooker

SRE Weekly Issue #521

lex

June 14, 2026

General

Comments

View on sreweekly.com

In incidents, swarming is a feature, not a bug

Spontaneous swarming of responders might seem like a nuisance that breaks our tidy mental models of incident response, but it’s actually very powerful. It’s something to facilitate and encourage, not simply tolerate.

Brent Chapman

Exactly Once Processing: Myth vs Reality

The misconception is that the local assurances automatically combine to form a single end-to-end promise that spans brokers, processors, databases, outboxes, caches, webhooks, and external APIs.

Irullappan irulandi — DZone

How we reduced core unit boot time from hours to minutes

When a firmware issue caused reboots for firmware upgrades to take four hours(!), they had to find a solution.

Giovanni Pereira Zantedeschi, Nnamdi Ajah, and Omar Sheik-Omar — Cloudflare

AI enthusiasts are in a race against time, AI skeptics are in a race against entropy

This one strikes a balance on AI that really speaks to me.

If you’re the one left holding the bag, you should generally get final say over what goes in that bag.

Charity Majors

Sitar-agent: Building a reliable dynamic configuration sidecar at scale

How Airbnb built a Kubernetes sidecar to deliver dynamic configuration reliably at scale.

Bo Teng — Airbnb

When failover isn’t safe: Building high-availability PostgreSQL on Kubernetes

In this post, we’ll walk through how we redesigned our Kubernetes-based PostgreSQL clusters for failover safety, how we balanced durability against latency, and what we learned while validating this approach through benchmarking and failure testing.

Shree Sampath — Datadog

When Claude changed, everything changed: Managing AI blast radius in production

The failure mode on this one is really interesting, and the bit about “infinite blast radius” caught my eye.

Sarat Mahavratayajula ,Vijay Sagar Gullapalli — VentureBeat

Why we need resilient software design – Part 2

I’m enjoying this series so far, and I’m looking forward to reading the rest. It’s worth starting at part 1, but part 2 can stand on its own in a pinch.

Uwe Friedrichsen

SRE Weekly Issue #520

lex

June 7, 2026

General

Comments

View on sreweekly.com

AI Agents Expose a Design Gap in Microservices Resilience

We build our systems against the usage patterns of human users, but agents fundamentally change the game.

Vineet Bhatkoti — DZone

AI agents are quietly generating chaos engineering failures enterprises don’t track yet

This is an interesting lens for exploring the risks that agents can introduce.

Sayali Patil — VentureBeat

Reddit r/sre: How long does your company give new people before they put them oncall

Great discussion in the comments! There’s a lot of variance in how much time people recommend. I personally tend to lean earlier — on-call is a great way to learn, and I can always reach out if I get stuck.

u/modern_medicine_isnt and commenters — Reddit r/sre

Metastable Failures Explained: Why Fixing the Trigger Fails

A great into to the concept of metastable failures — and I recommend reading the original paper as well.

Teiva Harsanyi

Most Companies Wait Too Long to Declare Incidents

The real issue is that your company has made declaring an incident costly and risky for the person who does it.

Brent Chapman

A postmortem of our May 7, 2026 outage

I enjoyed learning about their deliberate architectural choice to keep their central service in a single AZ. This incident highlighted a need for a fast failover plan.

Coinbase

Customers over control: how we measure On-call reliability

I like the balance between ensuring 99.99% reliability and designing their product to encourage customers to use their platform in a way that effectively manages the 0.01% case.

Reliability is a customer experience problem

Mike Fisher — incident.io

The demon of the gaps

I’m not gonna spoil this one for you by writing a summary. Just read it, trust me.

Lorin Hochstein

SRE Weekly Issue #519

lex

May 31, 2026

General

Comments

View on sreweekly.com

The Problem with AI-Generated Post-Incident Reviews

They give solid examples to argue that much of the learning happens during the process of writing a post-incident review.

[…] you could throw the post-incident review document away after writing it and still get the vast majority of the value out of the process.

Brent Chapman

You Shipped It Fast. But Did You Ship It Right?

I really like this idea of change absorption capacity.

Priya Gopalsamy — Stack Overflow

On benchmarking

A useful guide that covers strategies for benchmarking, along with pitfalls to avoid.

Ben Dicken — PlanetScale

Serverless Illusion: When “Pay What You Use” is Expensive

Serverless isn’t inherently cheaper. Hidden costs add up, and at scale it’s often pricier than containers — best for sporadic, not steady workloads.

David Iyanu Jonathan — DZone

Humans aren’t fast enough for 4 9’s

With just under 4.5 minutes of leeway for outages per month, you have to rely on automated remediation. AI can help, but it’s not a full solution, per this article.

Norberto Lopes — incident.io

blog dds: 2026-05-23 — Why reviewing AI-generated code is devilishly hard

LLMs are specifically designed to generate plausible-seeming output, and this makes reviewing especially difficult.

Diomidis Spinellis

The 28-Hour Meltdown: What Happened When AWS US-EAST-1 Overheated

A breakdown of the 28-hour aws us-east-1 outage in may 2026. What caused it, what went down, and what it means for how you design your infrastructure.

Alon Shrestha

Why Teamwork Makes (Or Breaks) Your Incident Response

This article has a list of common problems in incident response, and I feel like printing it and taping it to my wall.

Karan Nagarajagowda — Uptime Labs

SRE Weekly Issue #518

lex

May 24, 2026

General

Comments

View on sreweekly.com

When AI SRE Fails: Production Reality, Failure Modes, and What They Cost

This article gives you the failure data, cost data, and risk picture you need to make an accurate decision about AI SRE adoption.

James A. Wondrasek — softwareseni

DORA metrics are lying to you and AI is making it worse

The blind spot isn’t delivery, its legibility: DORA measures work flowing through the pipe, not whether anyone can explain what’s in it.

Paul LaPosta — LeadDev

Monitoring reliably at scale

But what happens when your observability stack is dependent on the same systems that are failing? In that moment, the dashboards go dark, alerts stop firing, and the tools meant to guide recovery become part of the outage.

Abdurrahman J. Allawala — Airbnb

The Pulse: AI load breaks GitHub – why not other vendors?

A thoughtful analysis of GitHub’s availability trouble of late, including some excellent reporting work to get more details on a growth graph previously shared by GitHub.

Gergely Orosz — The Pragmatic Engineer

Flipping the bozo bit on flips the learning off

Here’s a good one introducing the concept of distancing through differencing.

By focusing on the differences, they see no lessons for their own operation and practices.

Lorin Hochstein

You’ve Got (Too Much) Mail: Behind the Scenes of the 3/25/26 Voice Outage

In this post, we’ll peek behind the curtain and see how one seemingly innocuous change overwhelmed a system multiple hops away and how our not-fun afternoon helped us improve Discord.

Discord

Incident Report: May 19, 2026- GCP Account Suspension

Oof. GCP suspended their account “as part of an automated action”, killing production.

This may sound familiar, because GCP did something very similar almost exactly 2 years ago.

Chandrika Khanduri & Cody De Arkland — Railway

Gemini 3.5 deleted 28,745 lines, broke production for 33 minutes, and wrote itself a fake post-mortem claiming credit for the fix

What a story! They discovered that they had inadvertently installed a quite harmful agent ruleset. Before you dismiss this by thinking “I’d never do that”, go back up and read Lorin Hochstein’s article above.

u/dvrkstar — r/bard (Reddit)

SRE Weekly Issue #522

SRE Weekly Issue #521

SRE Weekly Issue #520

SRE Weekly Issue #519

SRE Weekly Issue #518

Subscribe

RSS

Mastodon

Search Issues

A message from our sponsor, Bronto:

A message from our sponsor, Bronto:

A message from our sponsor, BigPanda:

A message from our sponsor, BigPanda:

A message from our sponsor, BigPanda:

Subscribe

RSS

Mastodon

Search Issues