SRE Weekly Issue #518

A message from our sponsor, BigPanda:

When a P1 fires, scope, impact, and cause should be instant.

Instead you’re 10 minutes in, pinging people across tools and teams to understand what’s happening. BigPanda surfaces the full picture the moment an incident starts so you fix, not hunt.

Reduce incident toil

This article gives you the failure data, cost data, and risk picture you need to make an accurate decision about AI SRE adoption.

  James A. Wondrasek — softwareseni

The blind spot isn’t delivery, its legibility: DORA measures work flowing through the pipe, not whether anyone can explain what’s in it.

  Paul LaPosta — LeadDev

But what happens when your observability stack is dependent on the same systems that are failing? In that moment, the dashboards go dark, alerts stop firing, and the tools meant to guide recovery become part of the outage.

  Abdurrahman J. Allawala — Airbnb

A thoughtful analysis of GitHub’s availability trouble of late, including some excellent reporting work to get more details on a growth graph previously shared by GitHub.

  Gergely Orosz — The Pragmatic Engineer

Here’s a good one introducing the concept of distancing through differencing.

By focusing on the differences, they see no lessons for their own operation and practices.

  Lorin Hochstein

In this post, we’ll peek behind the curtain and see how one seemingly innocuous change overwhelmed a system multiple hops away and how our not-fun afternoon helped us improve Discord.

  Discord

Oof. GCP suspended their account “as part of an automated action”, killing production.

This may sound familiar, because GCP did something very similar almost exactly 2 years ago.

  Chandrika Khanduri & Cody De Arkland — Railway

What a story! They discovered that they had inadvertently installed a quite harmful agent ruleset. Before you dismiss this by thinking “I’d never do that”, go back up and read Lorin Hochstein’s article above.

  u/dvrkstar — r/bard (Reddit)

SRE Weekly Issue #517

A message from our sponsor, BigPanda:

No single team sees the full incident anymore.

Today’s P1s break across services, teams, and infrastructure. Instead of chasing dashboards, waiting on tribal knowledge, or piecing together signals from siloed systems, BigPanda surfaces the complete picture to pinpoint root cause faster.

See BigPanda for SREs

There’s some great advice in here. My favorite: be explicit about choosing or not choosing to do something.

  incident.io

Live video delivery is an intensely reliability-critical endeavor, and Netflix pull back on the curtain on how they tackled it.

  Brett Axler, Casper Choffat, and Alo Lowry — Netflix

Java uses memory outside of the heap, so it can OOM in a container even if the heap size is well below the container’s memory limit.

   Ramya vani Rayala — DZone

It’s not about obviously wrong stuff — it’s the queries that look good on the surface that can bet you in trouble, per this article. They also share methods to vet LLM-generated SQL.

  Readyset

The mental model we use: AI handles the effort so humans can focus on the insight. Not AI instead of thinking.

  incident.io

[…] because AI tools continue to make it cheaper to write (and rewrite) code on demand, production environments will become the primary place to evaluate whether software is correct or incorrect.

  Peter Farago — RunLLM

The old way: heroes in incident response are an anti-pattern.
The new way: heroes are great and we should make as many heroes as we can.

  Hamed Silatani — Uptime Labs

I had to read this one twice before I had my galaxy-brain moment in the second-to-last paragraph.

  Lorin Hochstein

SRE Weekly Issue #516

A message from our sponsor, incident.io:

Paging is just 10% of your incident workflow. incident.io’s 4-step framework turns migration into a forcing function for the other 90%: cut alert noise, fix service ownership, and build the on-call program your team actually deserves.

Just ensuring your query hits an index isn’t enough — it has to be using it well.

  Nenad Noveljic and Bowen Chen — Datadog

A practical look at where AI genuinely helps SRE teams, and what “AI-powered operations” can realistically deliver in production.

This one’s balanced: some optimism and excitement with a healthy dose of skepticism and caution.

   Ashly Joseph and Jithu Paulose — DZone

It’s not about avoiding naming names.

Be wary of successfully avoiding retribution, yet finding your post-incident process still biased towards an individualistic stance instead of a systemic one.

  Fred Hebert — Resilience in Software Foundation

I love that this article takes the AI-and-code-ownership conversation all the way to production. It’s not enough to review what the AI wrote — if you’re not also the one carrying the pager for it, the accountability loop falls apart.

  Peter Farago — RunLLM

The confluence of agent failure with Railway’s behavior of deleting all backups makes this one especially noteworthy.

  Mark Tyson — Tom’s Hardware

A fun debugging story with a noteworthy cause. I’m gonna be keeping a closer eye on cgroups.

  Vaibhav Shankar, Raymond Lee, Chia-Wei Chen, Shunyao Li, Yi Li, Ambud Sharma, Saurabh Vishwas Joshi, Charles-A. Francisco, Karthik Anantha Padmanabhan, and David Westbrook — Pinterest

It’s gonna be okay, really! If you’re going on-call for the first time, read this one. For the thousandth time? You should read it too.

  Jos Visser

The Left-Over Principle: what’s left for humans to do when you’ve automated everything possible.

[…] each advance in AI incident response will render increasingly complex scenarios ‘Left-Over’ to human intelligence, which itself will be less and less prepared to deal with them.

  Stuart Rimell — Uptime Labs

Springing off from a LinkedIn comment by John Allspaw, this one goes into the differences between the Safety I and II approaches.

  Lorin Hochstein

SRE Weekly Issue #515

A message from our sponsor, atscaleconference.com:

Building scalable, high-performance infrastructure for AI is one of today’s toughest challenges. Join @Scale: Systems & Reliability on June 25 in Bellevue, WA to learn how leading engineers are solving it.

Secure your seat today!

Why Reliability Metrics Age Faster Than the Systems They Measure

Is your dashboard always green because everything is working, or because your metrics are lying?

  Barnadeep Bhowmik — Stackademic

But when we rolled out the new query, disk writes doubled and Write-Ahead Logging (WAL) syncs quadrupled. We discovered that even when an upsert doesn’t change any values, it still locks the conflicting row, which is recorded in the WAL.

Yikes! Click through to learn how they figured it out and what they did about it.

  Anthonin Bonnefoy — Datadog

it’s important not just to try to prevent incidents but to be fully ready for them when they inevitably happen anyway.

  Joe Mckevitt — Uptime Labs

Queues absorb spikes but not sustained overload. Without backpressure, limits, and monitoring, backlogs grow until systems fail.

   David Iyanu Jonathan — DZone

Oof. The code exhausted all ephemeral ports and then they logged itself to death complaining about it. I love the workaround. Loopback is a /8!

  Jim Calabro — Bluesky

…and here’s an awesome analysis and explanation of the Bluesky writeup. I’ve definitely been down the path of scratching my head about EADDRINUSE before.

  Lorin Hochstein

AI didn’t solve the problem for them, but it sped up the grunt-work and significantly reduced their iteration time, letting them get to an answer much faster.

  Tristan Streichenberger — Mixpanel

It’s interesting to me that this is essentially an outage/degradation report, but the definition of system degradation for an LLM tool is much more subjective than with traditional services. The “ablation testing” concept is really neat.

  Anthropic

SRE Weekly Issue #514

Finally! Someone actually explaining how they test their SRE agent. Having a testing methodology is table stakes. Showing their work helps us decide whether we can trust the tool.

With so many SRE agents floating around, it’s quite surprising to me that this kind of article is so rare.

  Benjamin Barton — Datadog

An enlightening Deep dive into the way this Postgres resource management system evaluates the cost of queries in order to shed resource intensive ones.

  Patrick Reynolds — PlanetScale

If you’ve ever been in an incident where communication suddenly went quiet and access got restricted, this article explains why. The author breaks down five fundamental ways security incident response diverges from outage response — and why the instincts that make you effective at one can actively work against you in the other.

  Art Kondratiev — Uptime Labs

Security and reliability are inexorably intertwined. Examples: reliability failures leave security temporarily weak and vulnerable, and security changes have caused a number of recent high-profile outages/

  Oreoluwa Omoike — DZone

Some timely reminders about the realities of how autoscaling actually works in Kubernetes. It’s all about tuning your mental model.

   Ankush Madaan — DZone

There’s a limit to how far parallelism can get you, and it’s down to what part of your workload is by necessity serial.

[…] in practice, microservices that share a database or coordinate on every request are a distributed monolith with extra latency and a much harder debugging story.

   David Iyanu Jonathan — DZone

This is a great story, and I really liked the section on why traditional reliability techniques (autoscaling, circuit breakers, and rate limits) weren’t enough.

  Parveen Saini — DZone

A production of Tinker Tinker Tinker, LLC Frontier Theme