SRE Weekly Issue #516

A message from our sponsor, incident.io:

Paging is just 10% of your incident workflow. incident.io’s 4-step framework turns migration into a forcing function for the other 90%: cut alert noise, fix service ownership, and build the on-call program your team actually deserves.

Just ensuring your query hits an index isn’t enough — it has to be using it well.

  Nenad Noveljic and Bowen Chen — Datadog

A practical look at where AI genuinely helps SRE teams, and what “AI-powered operations” can realistically deliver in production.

This one’s balanced: some optimism and excitement with a healthy dose of skepticism and caution.

   Ashly Joseph and Jithu Paulose — DZone

It’s not about avoiding naming names.

Be wary of successfully avoiding retribution, yet finding your post-incident process still biased towards an individualistic stance instead of a systemic one.

  Fred Hebert — Resilience in Software Foundation

I love that this article takes the AI-and-code-ownership conversation all the way to production. It’s not enough to review what the AI wrote — if you’re not also the one carrying the pager for it, the accountability loop falls apart.

  Peter Farago — RunLLM

The confluence of agent failure with Railway’s behavior of deleting all backups makes this one especially noteworthy.

  Mark Tyson — Tom’s Hardware

A fun debugging story with a noteworthy cause. I’m gonna be keeping a closer eye on cgroups.

  Vaibhav Shankar, Raymond Lee, Chia-Wei Chen, Shunyao Li, Yi Li, Ambud Sharma, Saurabh Vishwas Joshi, Charles-A. Francisco, Karthik Anantha Padmanabhan, and David Westbrook — Pinterest

It’s gonna be okay, really! If you’re going on-call for the first time, read this one. For the thousandth time? You should read it too.

  Jos Visser

The Left-Over Principle: what’s left for humans to do when you’ve automated everything possible.

[…] each advance in AI incident response will render increasingly complex scenarios ‘Left-Over’ to human intelligence, which itself will be less and less prepared to deal with them.

  Stuart Rimell — Uptime Labs

Springing off from a LinkedIn comment by John Allspaw, this one goes into the differences between the Safety I and II approaches.

  Lorin Hochstein

SRE Weekly Issue #515

A message from our sponsor, atscaleconference.com:

Building scalable, high-performance infrastructure for AI is one of today’s toughest challenges. Join @Scale: Systems & Reliability on June 25 in Bellevue, WA to learn how leading engineers are solving it.

Secure your seat today!

Why Reliability Metrics Age Faster Than the Systems They Measure

Is your dashboard always green because everything is working, or because your metrics are lying?

  Barnadeep Bhowmik — Stackademic

But when we rolled out the new query, disk writes doubled and Write-Ahead Logging (WAL) syncs quadrupled. We discovered that even when an upsert doesn’t change any values, it still locks the conflicting row, which is recorded in the WAL.

Yikes! Click through to learn how they figured it out and what they did about it.

  Anthonin Bonnefoy — Datadog

it’s important not just to try to prevent incidents but to be fully ready for them when they inevitably happen anyway.

  Joe Mckevitt — Uptime Labs

Queues absorb spikes but not sustained overload. Without backpressure, limits, and monitoring, backlogs grow until systems fail.

   David Iyanu Jonathan — DZone

Oof. The code exhausted all ephemeral ports and then they logged itself to death complaining about it. I love the workaround. Loopback is a /8!

  Jim Calabro — Bluesky

…and here’s an awesome analysis and explanation of the Bluesky writeup. I’ve definitely been down the path of scratching my head about EADDRINUSE before.

  Lorin Hochstein

AI didn’t solve the problem for them, but it sped up the grunt-work and significantly reduced their iteration time, letting them get to an answer much faster.

  Tristan Streichenberger — Mixpanel

It’s interesting to me that this is essentially an outage/degradation report, but the definition of system degradation for an LLM tool is much more subjective than with traditional services. The “ablation testing” concept is really neat.

  Anthropic

SRE Weekly Issue #514

Finally! Someone actually explaining how they test their SRE agent. Having a testing methodology is table stakes. Showing their work helps us decide whether we can trust the tool.

With so many SRE agents floating around, it’s quite surprising to me that this kind of article is so rare.

  Benjamin Barton — Datadog

An enlightening Deep dive into the way this Postgres resource management system evaluates the cost of queries in order to shed resource intensive ones.

  Patrick Reynolds — PlanetScale

If you’ve ever been in an incident where communication suddenly went quiet and access got restricted, this article explains why. The author breaks down five fundamental ways security incident response diverges from outage response — and why the instincts that make you effective at one can actively work against you in the other.

  Art Kondratiev — Uptime Labs

Security and reliability are inexorably intertwined. Examples: reliability failures leave security temporarily weak and vulnerable, and security changes have caused a number of recent high-profile outages/

  Oreoluwa Omoike — DZone

Some timely reminders about the realities of how autoscaling actually works in Kubernetes. It’s all about tuning your mental model.

   Ankush Madaan — DZone

There’s a limit to how far parallelism can get you, and it’s down to what part of your workload is by necessity serial.

[…] in practice, microservices that share a database or coordinate on every request are a distributed monolith with extra latency and a much harder debugging story.

   David Iyanu Jonathan — DZone

This is a great story, and I really liked the section on why traditional reliability techniques (autoscaling, circuit breakers, and rate limits) weren’t enough.

  Parveen Saini — DZone

SRE Weekly Issue #513

A message from our sponsor, incident.io:

“Lifting and shifting” noise to new tools just buys a different UI for the same burnout. incident.io’s migration framework prioritizes service cataloging and inventory to fix ownership, preventing team friction during the transition to a scalable on-call system.

A previously unpublished article by the late Dr. Richard Cook!

Organizational Second Hit Syndrome is an incident-related phenomenon analogous to neurological second-impact-syndrome (SIS). It occurs when a major incident creates a vulnerable period during which a second incident generates strong, widespread, and sometimes destructive organizational reactions.

  John Allspaw and Dr. Richard I. Cook — Adaptive Capacity Labs

Over 20k mounts to run 100 containers! And NUMA issues too. This one really drives home the fact that SREs need to be cognizant of all layers of the stack.

  Harshad Sane and Andrew Halaney — Netflix

Cost explosion is a reliability problem. I love the idea of surfacing sudden cost increase as an alert that something is probably going wrong.

   David Iyanu Jonathan — DZone

Autoscaling is reactive, not resilient. Without caps, metrics, or overrides, it can worsen failures. True elasticity requires policy, testing, and bottleneck awareness.

Raise your hand if your system has ever autoscaled itself to death. ✋

   David Iyanu Jonathan — DZone

Heinrich Hartmann argues AI’s most valuable role in SRE isn’t autonomous remediation. It’s making sure on-call engineers have the context to fix incidents fast.

  Peter Farago — RunLLM

As usual, I enjoy reading Lorin’s analysis of GitHub’s writeup on their incidents just as much as the writeup itself, if not more. Saturation, a security mechanism causing an outage, and more.

  Lorin Hochstein

Airbnb made a big move, migrating to a new observability stack. They explain how they structured the project to deliver a big win as early as possible, building buy-in.

  Callum Jones — Airbnb

Each one of these is like a pile of War Stories all gathered up into a tidy package of we can learn from.

  Karan Nagarajagowda — Uptime Labs

SRE Weekly Issue #512

A message from our sponsor, Archera:

AI workloads are unpredictable, which makes cloud commitments feel like a gamble. Archera insures your commitments against underutilization, so you can push coverage higher without the risk of getting stuck. If usage drops, Archera covers the downside. Commitment Release Guarantee included.

Start Saving

Improving robustness requires increasing complexity. Let’s throw more complexity at it?

I’m using this enormously complex system, an LLM, to help me solve a problem that was created by software complexity in the first place.

  Lorin Hochstein

This feels like using multiple agents as a sort of redundancy and cross-validation architecture to improve the reliability of agent output..

  Alex Ewerlöf

This article explains why end-to-end testing breaks down in microservice-based systems, not due to poor tooling, but because of fundamental architectural and operational mismatches.

   Alok Kumar — DZone

LaunchDarkly’s survey data show have some interesting things to say about the impact of AI.

[…] while build and deployment velocity have improved, production reliability has not.

  LaunchDarkly

Fred Hebert surveyed how AI coding assistants vs. AI SRE tools are marketed and found a stark divide: coding assistants are framed as partners that augment engineers, while AI SREs are framed as replacements for low-value work. The implication is that the people building and buying these tools see incident response as grunt work to be automated away — and that says a lot about how decision-makers perceive the role.

  Fred Hebert

I especially like the point that incidents are leadership moments — how you respond tells your team everything about the culture you’re building. This one is aimed at CTOs, but really it’s a great reminder for anyone in a leadership role during incidents.

  Joe Mckevitt — Uptime Labs

There’s a really interesting bit in this one about libraries and layers of the system doing their own retries without your knowledge, magnifying retry volume.

   David Iyanu Jonathan — DZone

I like the section on what AI should and shouldn’t do. It’s important to avoid automating away the process of learning from incidents.

  incident.io

A production of Tinker Tinker Tinker, LLC Frontier Theme