SRE Weekly Issue #520

A message from our sponsor, BigPanda:

Your team solved this incident last month. Why is it back?

Because you fixed the symptom, not the cause. BigPanda surfaces the pattern behind repeat incidents and tells you what to fix so the next on-call doesn’t fight the same P1.

Prevent incidents proactively

We build our systems against the usage patterns of human users, but agents fundamentally change the game.

   Vineet Bhatkoti — DZone

This is an interesting lens for exploring the risks that agents can introduce.

  Sayali Patil — VentureBeat

Great discussion in the comments! There’s a lot of variance in how much time people recommend. I personally tend to lean earlier — on-call is a great way to learn, and I can always reach out if I get stuck.

  u/modern_medicine_isnt and commenters — Reddit r/sre

A great into to the concept of metastable failures — and I recommend reading the original paper as well.

  Teiva Harsanyi

The real issue is that your company has made declaring an incident costly and risky for the person who does it.

  Brent Chapman

I enjoyed learning about their deliberate architectural choice to keep their central service in a single AZ. This incident highlighted a need for a fast failover plan.

  Coinbase

I like the balance between ensuring 99.99% reliability and designing their product to encourage customers to use their platform in a way that effectively manages the 0.01% case.

Reliability is a customer experience problem

  Mike Fisher — incident.io

I’m not gonna spoil this one for you by writing a summary. Just read it, trust me.

  Lorin Hochstein

SRE Weekly Issue #519

A message from our sponsor, BigPanda:

What if you could predict which changes will cause incidents?

BigPanda analyzes every change, including ones marked safe, to surface the real risk and impact before deployment. Next time, routine changes don’t become your next P1.

See BigPanda for SREs

They give solid examples to argue that much of the learning happens during the process of writing a post-incident review.

[…] you could throw the post-incident review document away after writing it and still get the vast majority of the value out of the process.

  Brent Chapman

I really like this idea of change absorption capacity.

  Priya Gopalsamy — Stack Overflow

A useful guide that covers strategies for benchmarking, along with pitfalls to avoid.

  Ben Dicken — PlanetScale

Serverless isn’t inherently cheaper. Hidden costs add up, and at scale it’s often pricier than containers — best for sporadic, not steady workloads.

   David Iyanu Jonathan — DZone

With just under 4.5 minutes of leeway for outages per month, you have to rely on automated remediation. AI can help, but it’s not a full solution, per this article.

  Norberto Lopes — incident.io

LLMs are specifically designed to generate plausible-seeming output, and this makes reviewing especially difficult.

  Diomidis Spinellis

A breakdown of the 28-hour aws us-east-1 outage in may 2026. What caused it, what went down, and what it means for how you design your infrastructure.

  Alon Shrestha

This article has a list of common problems in incident response, and I feel like printing it and taping it to my wall.

  Karan Nagarajagowda — Uptime Labs

SRE Weekly Issue #518

A message from our sponsor, BigPanda:

When a P1 fires, scope, impact, and cause should be instant.

Instead you’re 10 minutes in, pinging people across tools and teams to understand what’s happening. BigPanda surfaces the full picture the moment an incident starts so you fix, not hunt.

Reduce incident toil

This article gives you the failure data, cost data, and risk picture you need to make an accurate decision about AI SRE adoption.

  James A. Wondrasek — softwareseni

The blind spot isn’t delivery, its legibility: DORA measures work flowing through the pipe, not whether anyone can explain what’s in it.

  Paul LaPosta — LeadDev

But what happens when your observability stack is dependent on the same systems that are failing? In that moment, the dashboards go dark, alerts stop firing, and the tools meant to guide recovery become part of the outage.

  Abdurrahman J. Allawala — Airbnb

A thoughtful analysis of GitHub’s availability trouble of late, including some excellent reporting work to get more details on a growth graph previously shared by GitHub.

  Gergely Orosz — The Pragmatic Engineer

Here’s a good one introducing the concept of distancing through differencing.

By focusing on the differences, they see no lessons for their own operation and practices.

  Lorin Hochstein

In this post, we’ll peek behind the curtain and see how one seemingly innocuous change overwhelmed a system multiple hops away and how our not-fun afternoon helped us improve Discord.

  Discord

Oof. GCP suspended their account “as part of an automated action”, killing production.

This may sound familiar, because GCP did something very similar almost exactly 2 years ago.

  Chandrika Khanduri & Cody De Arkland — Railway

What a story! They discovered that they had inadvertently installed a quite harmful agent ruleset. Before you dismiss this by thinking “I’d never do that”, go back up and read Lorin Hochstein’s article above.

  u/dvrkstar — r/bard (Reddit)

SRE Weekly Issue #517

A message from our sponsor, BigPanda:

No single team sees the full incident anymore.

Today’s P1s break across services, teams, and infrastructure. Instead of chasing dashboards, waiting on tribal knowledge, or piecing together signals from siloed systems, BigPanda surfaces the complete picture to pinpoint root cause faster.

See BigPanda for SREs

There’s some great advice in here. My favorite: be explicit about choosing or not choosing to do something.

  incident.io

Live video delivery is an intensely reliability-critical endeavor, and Netflix pull back on the curtain on how they tackled it.

  Brett Axler, Casper Choffat, and Alo Lowry — Netflix

Java uses memory outside of the heap, so it can OOM in a container even if the heap size is well below the container’s memory limit.

   Ramya vani Rayala — DZone

It’s not about obviously wrong stuff — it’s the queries that look good on the surface that can bet you in trouble, per this article. They also share methods to vet LLM-generated SQL.

  Readyset

The mental model we use: AI handles the effort so humans can focus on the insight. Not AI instead of thinking.

  incident.io

[…] because AI tools continue to make it cheaper to write (and rewrite) code on demand, production environments will become the primary place to evaluate whether software is correct or incorrect.

  Peter Farago — RunLLM

The old way: heroes in incident response are an anti-pattern.
The new way: heroes are great and we should make as many heroes as we can.

  Hamed Silatani — Uptime Labs

I had to read this one twice before I had my galaxy-brain moment in the second-to-last paragraph.

  Lorin Hochstein

SRE Weekly Issue #516

A message from our sponsor, incident.io:

Paging is just 10% of your incident workflow. incident.io’s 4-step framework turns migration into a forcing function for the other 90%: cut alert noise, fix service ownership, and build the on-call program your team actually deserves.

Just ensuring your query hits an index isn’t enough — it has to be using it well.

  Nenad Noveljic and Bowen Chen — Datadog

A practical look at where AI genuinely helps SRE teams, and what “AI-powered operations” can realistically deliver in production.

This one’s balanced: some optimism and excitement with a healthy dose of skepticism and caution.

   Ashly Joseph and Jithu Paulose — DZone

It’s not about avoiding naming names.

Be wary of successfully avoiding retribution, yet finding your post-incident process still biased towards an individualistic stance instead of a systemic one.

  Fred Hebert — Resilience in Software Foundation

I love that this article takes the AI-and-code-ownership conversation all the way to production. It’s not enough to review what the AI wrote — if you’re not also the one carrying the pager for it, the accountability loop falls apart.

  Peter Farago — RunLLM

The confluence of agent failure with Railway’s behavior of deleting all backups makes this one especially noteworthy.

  Mark Tyson — Tom’s Hardware

A fun debugging story with a noteworthy cause. I’m gonna be keeping a closer eye on cgroups.

  Vaibhav Shankar, Raymond Lee, Chia-Wei Chen, Shunyao Li, Yi Li, Ambud Sharma, Saurabh Vishwas Joshi, Charles-A. Francisco, Karthik Anantha Padmanabhan, and David Westbrook — Pinterest

It’s gonna be okay, really! If you’re going on-call for the first time, read this one. For the thousandth time? You should read it too.

  Jos Visser

The Left-Over Principle: what’s left for humans to do when you’ve automated everything possible.

[…] each advance in AI incident response will render increasingly complex scenarios ‘Left-Over’ to human intelligence, which itself will be less and less prepared to deal with them.

  Stuart Rimell — Uptime Labs

Springing off from a LinkedIn comment by John Allspaw, this one goes into the differences between the Safety I and II approaches.

  Lorin Hochstein

A production of Tinker Tinker Tinker, LLC Frontier Theme