SRE WEEKLY – scalability, availability, incident response, automation

SRE Weekly Issue #518

lex

May 24, 2026

When AI SRE Fails: Production Reality, Failure Modes, and What They Cost

This article gives you the failure data, cost data, and risk picture you need to make an accurate decision about AI SRE adoption.

James A. Wondrasek — softwareseni

DORA metrics are lying to you and AI is making it worse

The blind spot isn’t delivery, its legibility: DORA measures work flowing through the pipe, not whether anyone can explain what’s in it.

Paul LaPosta — LeadDev

Monitoring reliably at scale

But what happens when your observability stack is dependent on the same systems that are failing? In that moment, the dashboards go dark, alerts stop firing, and the tools meant to guide recovery become part of the outage.

Abdurrahman J. Allawala — Airbnb

The Pulse: AI load breaks GitHub – why not other vendors?

A thoughtful analysis of GitHub’s availability trouble of late, including some excellent reporting work to get more details on a growth graph previously shared by GitHub.

Gergely Orosz — The Pragmatic Engineer

Flipping the bozo bit on flips the learning off

Here’s a good one introducing the concept of distancing through differencing.

By focusing on the differences, they see no lessons for their own operation and practices.

Lorin Hochstein

You’ve Got (Too Much) Mail: Behind the Scenes of the 3/25/26 Voice Outage

In this post, we’ll peek behind the curtain and see how one seemingly innocuous change overwhelmed a system multiple hops away and how our not-fun afternoon helped us improve Discord.

Discord

Incident Report: May 19, 2026- GCP Account Suspension

Oof. GCP suspended their account “as part of an automated action”, killing production.

This may sound familiar, because GCP did something very similar almost exactly 2 years ago.

Chandrika Khanduri & Cody De Arkland — Railway

Gemini 3.5 deleted 28,745 lines, broke production for 33 minutes, and wrote itself a fake post-mortem claiming credit for the fix

What a story! They discovered that they had inadvertently installed a quite harmful agent ruleset. Before you dismiss this by thinking “I’d never do that”, go back up and read Lorin Hochstein’s article above.

u/dvrkstar — r/bard (Reddit)

SRE Weekly Issue #517

lex

May 17, 2026

General

Comments

View on sreweekly.com

Why post-mortem action items die

There’s some great advice in here. My favorite: be explicit about choosing or not choosing to do something.

incident.io

The Human Infrastructure: How Netflix Built the Operations Layer Behind Live at Scale

Live video delivery is an intensely reliability-critical endeavor, and Netflix pull back on the curtain on how they tackled it.

Brett Axler, Casper Choffat, and Alo Lowry — Netflix

The Invisible OOMKill: Why Your Java Pod Keeps Restarting in Kubernetes

Java uses memory outside of the heap, so it can OOM in a container even if the heap size is well below the container’s memory limit.

Ramya vani Rayala — DZone

Why LLMs Write Incorrect SQL (and What That Means for Your Database)

It’s not about obviously wrong stuff — it’s the queries that look good on the surface that can bet you in trouble, per this article. They also share methods to vet LLM-generated SQL.

Readyset

What does using AI for post-mortems actually mean?

The mental model we use: AI handles the effort so humans can focus on the insight. Not AI instead of thinking.

incident.io

The Code Nobody Read Is Already in Production

[…] because AI tools continue to make it cheaper to write (and rewrite) code on demand, production environments will become the primary place to evaluate whether software is correct or incorrect.

Peter Farago — RunLLM

The Incident Hero Trap

The old way: heroes in incident response are an anti-pattern.
The new way: heroes are great and we should make as many heroes as we can.

Hamed Silatani — Uptime Labs

How incidents can teach us about what’s already working well

I had to read this one twice before I had my galaxy-brain moment in the second-to-last paragraph.

Lorin Hochstein

SRE Weekly Issue #516

lex

May 10, 2026

General

Comments

View on sreweekly.com

Not all index scans are equal: How we cut query latency by over 99%

Just ensuring your query hits an index isn’t enough — it has to be using it well.

Nenad Noveljic and Bowen Chen — Datadog

AI in SRE: What’s Actually Coming in 2026

A practical look at where AI genuinely helps SRE teams, and what “AI-powered operations” can realistically deliver in production.

This one’s balanced: some optimism and excitement with a healthy dose of skepticism and caution.

Ashly Joseph and Jithu Paulose — DZone

Superficial Blamelessness

It’s not about avoiding naming names.

Be wary of successfully avoiding retribution, yet finding your post-incident process still biased towards an individualistic stance instead of a systemic one.

Fred Hebert — Resilience in Software Foundation

I Don’t Care if AI Wrote the Code. You Own It.

I love that this article takes the AI-and-code-ownership conversation all the way to production. It’s not enough to review what the AI wrote — if you’re not also the one carrying the pager for it, the accountability loop falls apart.

Peter Farago — RunLLM

Claude-powered AI coding agent deletes entire company database in 9 seconds — backups zapped, after Cursor tool powered by Anthropic’s Claude goes rogue

The confluence of agent failure with Railway’s behavior of deleting all backups makes this one especially noteworthy.

Mark Tyson — Tom’s Hardware

Finding zombies in our systems: A real-world story of CPU bottlenecks

A fun debugging story with a noteworthy cause. I’m gonna be keeping a closer eye on cgroups.

Vaibhav Shankar, Raymond Lee, Chia-Wei Chen, Shunyao Li, Yi Li, Ambud Sharma, Saurabh Vishwas Joshi, Charles-A. Francisco, Karthik Anantha Padmanabhan, and David Westbrook — Pinterest

Ten things not to worry about regarding oncall

It’s gonna be okay, really! If you’re going on-call for the first time, read this one. For the thousandth time? You should read it too.

Jos Visser

What AI Incident Response Leaves Behind

The Left-Over Principle: what’s left for humans to do when you’ve automated everything possible.

[…] each advance in AI incident response will render increasingly complex scenarios ‘Left-Over’ to human intelligence, which itself will be less and less prepared to deal with them.

Stuart Rimell — Uptime Labs

The normal work of creating reliability

Springing off from a LinkedIn comment by John Allspaw, this one goes into the differences between the Safety I and II approaches.

Lorin Hochstein

SRE Weekly Issue #515

lex

May 3, 2026

General

Comments

View on sreweekly.com

The Silent Failure of Reliability Metrics at Scale: Lessons Learned from a Decade of Broken Metrics

Why Reliability Metrics Age Faster Than the Systems They Measure

Is your dashboard always green because everything is working, or because your metrics are lying?

Barnadeep Bhowmik — Stackademic

When upserts don’t update but still write: Debugging Postgres performance at scale

But when we rolled out the new query, disk writes doubled and Write-Ahead Logging (WAL) syncs quadrupled. We discovered that even when an upsert doesn’t change any values, it still locks the conflicting row, which is recorded in the WAL.

Yikes! Click through to learn how they figured it out and what they did about it.

Anthonin Bonnefoy — Datadog

Incidents *Will* Happen. Are You (Actually) Prepared?

it’s important not just to try to prevent incidents but to be fully ready for them when they inevitably happen anyway.

Joe Mckevitt — Uptime Labs

Why Queues Don’t Fix Scaling Problems

Queues absorb spikes but not sustained overload. Without backpressure, limits, and monitoring, backlogs grow until systems fail.

David Iyanu Jonathan — DZone

April 2026 Outage Post-Mortem

Oof. The code exhausted all ephemeral ports and then they logged itself to death complaining about it. I love the workaround. Loopback is a /8!

Jim Calabro — Bluesky

Thoughts on the Bluesky public incident write-up

…and here’s an awesome analysis and explanation of the Bluesky writeup. I’ve definitely been down the path of scratching my head about EADDRINUSE before.

Lorin Hochstein

How We Reduced Median Memory Estimation Error by 99%, With the Help of AI

AI didn’t solve the problem for them, but it sped up the grunt-work and significantly reduced their iteration time, letting them get to an answer much faster.

Tristan Streichenberger — Mixpanel

An update on recent Claude Code quality reports

It’s interesting to me that this is essentially an outage/degradation report, but the definition of system degradation for an LLM tool is much more subjective than with traditional services. The “ablation testing” concept is really neat.

Anthropic

SRE Weekly Issue #514

lex

April 26, 2026

General

Comments

View on sreweekly.com

How we built a real-world evaluation platform for autonomous SRE agents at scale

Finally! Someone actually explaining how they test their SRE agent. Having a testing methodology is table stakes. Showing their work helps us decide whether we can trust the tool.

With so many SRE agents floating around, it’s quite surprising to me that this kind of article is so rare.

Benjamin Barton — Datadog

Behind the scenes: How Database Traffic Control works

An enlightening Deep dive into the way this Postgres resource management system evaluates the cost of queries in order to shed resource intensive ones.

Patrick Reynolds — PlanetScale

Why Security Incidents Feel Different from Outages

If you’ve ever been in an incident where communication suddenly went quiet and access got restricted, this article explains why. The author breaks down five fundamental ways security incident response diverges from outage response — and why the instincts that make you effective at one can actively work against you in the other.

Art Kondratiev — Uptime Labs

Reliability Is Security: Why SRE Teams Are Becoming the Frontline of Cloud Defense

Security and reliability are inexorably intertwined. Examples: reliability failures leave security temporarily weak and vulnerable, and security changes have caused a number of recent high-profile outages/

Oreoluwa Omoike — DZone

Kubernetes Autoscaling: What Breaks Under Real Traffic

Some timely reminders about the realities of how autoscaling actually works in Kubernetes. It’s all about tuning your mental model.

Ankush Madaan — DZone

The Myth of Horizontal Scalability

There’s a limit to how far parallelism can get you, and it’s down to what part of your workload is by necessity serial.

[…] in practice, microservices that share a database or coordinate on every request are a distributed monolith with extra latency and a much harder debugging story.

David Iyanu Jonathan — DZone

How Our gRPC Services Collapsed During Traffic Bursts — and What Finally Stopped It

This is a great story, and I really liked the section on why traditional reliability techniques (autoscaling, circuit breakers, and rate limits) weren’t enough.

Parveen Saini — DZone

SRE Weekly Issue #518

SRE Weekly Issue #517

SRE Weekly Issue #516

SRE Weekly Issue #515

SRE Weekly Issue #514

Subscribe

RSS

Mastodon

Search Issues

A message from our sponsor, BigPanda:

A message from our sponsor, BigPanda:

A message from our sponsor, incident.io:

A message from our sponsor, atscaleconference.com:

Subscribe

RSS

Mastodon

Search Issues