SRE Weekly Issue #519

A message from our sponsor, BigPanda:

What if you could predict which changes will cause incidents?

BigPanda analyzes every change, including ones marked safe, to surface the real risk and impact before deployment. Next time, routine changes don’t become your next P1.

See BigPanda for SREs

They give solid examples to argue that much of the learning happens during the process of writing a post-incident review.

[…] you could throw the post-incident review document away after writing it and still get the vast majority of the value out of the process.

  Brent Chapman

I really like this idea of change absorption capacity.

  Priya Gopalsamy — Stack Overflow

A useful guide that covers strategies for benchmarking, along with pitfalls to avoid.

  Ben Dicken — PlanetScale

Serverless isn’t inherently cheaper. Hidden costs add up, and at scale it’s often pricier than containers — best for sporadic, not steady workloads.

   David Iyanu Jonathan — DZone

With just under 4.5 minutes of leeway for outages per month, you have to rely on automated remediation. AI can help, but it’s not a full solution, per this article.

  Norberto Lopes — incident.io

LLMs are specifically designed to generate plausible-seeming output, and this makes reviewing especially difficult.

  Diomidis Spinellis

A breakdown of the 28-hour aws us-east-1 outage in may 2026. What caused it, what went down, and what it means for how you design your infrastructure.

  Alon Shrestha

This article has a list of common problems in incident response, and I feel like printing it and taping it to my wall.

  Karan Nagarajagowda — Uptime Labs

SRE Weekly Issue #518

A message from our sponsor, BigPanda:

When a P1 fires, scope, impact, and cause should be instant.

Instead you’re 10 minutes in, pinging people across tools and teams to understand what’s happening. BigPanda surfaces the full picture the moment an incident starts so you fix, not hunt.

Reduce incident toil

This article gives you the failure data, cost data, and risk picture you need to make an accurate decision about AI SRE adoption.

  James A. Wondrasek — softwareseni

The blind spot isn’t delivery, its legibility: DORA measures work flowing through the pipe, not whether anyone can explain what’s in it.

  Paul LaPosta — LeadDev

But what happens when your observability stack is dependent on the same systems that are failing? In that moment, the dashboards go dark, alerts stop firing, and the tools meant to guide recovery become part of the outage.

  Abdurrahman J. Allawala — Airbnb

A thoughtful analysis of GitHub’s availability trouble of late, including some excellent reporting work to get more details on a growth graph previously shared by GitHub.

  Gergely Orosz — The Pragmatic Engineer

Here’s a good one introducing the concept of distancing through differencing.

By focusing on the differences, they see no lessons for their own operation and practices.

  Lorin Hochstein

In this post, we’ll peek behind the curtain and see how one seemingly innocuous change overwhelmed a system multiple hops away and how our not-fun afternoon helped us improve Discord.

  Discord

Oof. GCP suspended their account “as part of an automated action”, killing production.

This may sound familiar, because GCP did something very similar almost exactly 2 years ago.

  Chandrika Khanduri & Cody De Arkland — Railway

What a story! They discovered that they had inadvertently installed a quite harmful agent ruleset. Before you dismiss this by thinking “I’d never do that”, go back up and read Lorin Hochstein’s article above.

  u/dvrkstar — r/bard (Reddit)

SRE Weekly Issue #517

A message from our sponsor, BigPanda:

No single team sees the full incident anymore.

Today’s P1s break across services, teams, and infrastructure. Instead of chasing dashboards, waiting on tribal knowledge, or piecing together signals from siloed systems, BigPanda surfaces the complete picture to pinpoint root cause faster.

See BigPanda for SREs

There’s some great advice in here. My favorite: be explicit about choosing or not choosing to do something.

  incident.io

Live video delivery is an intensely reliability-critical endeavor, and Netflix pull back on the curtain on how they tackled it.

  Brett Axler, Casper Choffat, and Alo Lowry — Netflix

Java uses memory outside of the heap, so it can OOM in a container even if the heap size is well below the container’s memory limit.

   Ramya vani Rayala — DZone

It’s not about obviously wrong stuff — it’s the queries that look good on the surface that can bet you in trouble, per this article. They also share methods to vet LLM-generated SQL.

  Readyset

The mental model we use: AI handles the effort so humans can focus on the insight. Not AI instead of thinking.

  incident.io

[…] because AI tools continue to make it cheaper to write (and rewrite) code on demand, production environments will become the primary place to evaluate whether software is correct or incorrect.

  Peter Farago — RunLLM

The old way: heroes in incident response are an anti-pattern.
The new way: heroes are great and we should make as many heroes as we can.

  Hamed Silatani — Uptime Labs

I had to read this one twice before I had my galaxy-brain moment in the second-to-last paragraph.

  Lorin Hochstein

SRE Weekly Issue #516

A message from our sponsor, incident.io:

Paging is just 10% of your incident workflow. incident.io’s 4-step framework turns migration into a forcing function for the other 90%: cut alert noise, fix service ownership, and build the on-call program your team actually deserves.

Just ensuring your query hits an index isn’t enough — it has to be using it well.

  Nenad Noveljic and Bowen Chen — Datadog

A practical look at where AI genuinely helps SRE teams, and what “AI-powered operations” can realistically deliver in production.

This one’s balanced: some optimism and excitement with a healthy dose of skepticism and caution.

   Ashly Joseph and Jithu Paulose — DZone

It’s not about avoiding naming names.

Be wary of successfully avoiding retribution, yet finding your post-incident process still biased towards an individualistic stance instead of a systemic one.

  Fred Hebert — Resilience in Software Foundation

I love that this article takes the AI-and-code-ownership conversation all the way to production. It’s not enough to review what the AI wrote — if you’re not also the one carrying the pager for it, the accountability loop falls apart.

  Peter Farago — RunLLM

The confluence of agent failure with Railway’s behavior of deleting all backups makes this one especially noteworthy.

  Mark Tyson — Tom’s Hardware

A fun debugging story with a noteworthy cause. I’m gonna be keeping a closer eye on cgroups.

  Vaibhav Shankar, Raymond Lee, Chia-Wei Chen, Shunyao Li, Yi Li, Ambud Sharma, Saurabh Vishwas Joshi, Charles-A. Francisco, Karthik Anantha Padmanabhan, and David Westbrook — Pinterest

It’s gonna be okay, really! If you’re going on-call for the first time, read this one. For the thousandth time? You should read it too.

  Jos Visser

The Left-Over Principle: what’s left for humans to do when you’ve automated everything possible.

[…] each advance in AI incident response will render increasingly complex scenarios ‘Left-Over’ to human intelligence, which itself will be less and less prepared to deal with them.

  Stuart Rimell — Uptime Labs

Springing off from a LinkedIn comment by John Allspaw, this one goes into the differences between the Safety I and II approaches.

  Lorin Hochstein

SRE Weekly Issue #515

A message from our sponsor, atscaleconference.com:

Building scalable, high-performance infrastructure for AI is one of today’s toughest challenges. Join @Scale: Systems & Reliability on June 25 in Bellevue, WA to learn how leading engineers are solving it.

Secure your seat today!

Why Reliability Metrics Age Faster Than the Systems They Measure

Is your dashboard always green because everything is working, or because your metrics are lying?

  Barnadeep Bhowmik — Stackademic

But when we rolled out the new query, disk writes doubled and Write-Ahead Logging (WAL) syncs quadrupled. We discovered that even when an upsert doesn’t change any values, it still locks the conflicting row, which is recorded in the WAL.

Yikes! Click through to learn how they figured it out and what they did about it.

  Anthonin Bonnefoy — Datadog

it’s important not just to try to prevent incidents but to be fully ready for them when they inevitably happen anyway.

  Joe Mckevitt — Uptime Labs

Queues absorb spikes but not sustained overload. Without backpressure, limits, and monitoring, backlogs grow until systems fail.

   David Iyanu Jonathan — DZone

Oof. The code exhausted all ephemeral ports and then they logged itself to death complaining about it. I love the workaround. Loopback is a /8!

  Jim Calabro — Bluesky

…and here’s an awesome analysis and explanation of the Bluesky writeup. I’ve definitely been down the path of scratching my head about EADDRINUSE before.

  Lorin Hochstein

AI didn’t solve the problem for them, but it sped up the grunt-work and significantly reduced their iteration time, letting them get to an answer much faster.

  Tristan Streichenberger — Mixpanel

It’s interesting to me that this is essentially an outage/degradation report, but the definition of system degradation for an LLM tool is much more subjective than with traditional services. The “ablation testing” concept is really neat.

  Anthropic

A production of Tinker Tinker Tinker, LLC Frontier Theme