SRE Weekly Issue #515

A message from our sponsor, atscaleconference.com:

Building scalable, high-performance infrastructure for AI is one of today’s toughest challenges. Join @Scale: Systems & Reliability on June 25 in Bellevue, WA to learn how leading engineers are solving it.

Secure your seat today!

Why Reliability Metrics Age Faster Than the Systems They Measure

Is your dashboard always green because everything is working, or because your metrics are lying?

  Barnadeep Bhowmik — Stackademic

But when we rolled out the new query, disk writes doubled and Write-Ahead Logging (WAL) syncs quadrupled. We discovered that even when an upsert doesn’t change any values, it still locks the conflicting row, which is recorded in the WAL.

Yikes! Click through to learn how they figured it out and what they did about it.

  Anthonin Bonnefoy — Datadog

it’s important not just to try to prevent incidents but to be fully ready for them when they inevitably happen anyway.

  Joe Mckevitt — Uptime Labs

Queues absorb spikes but not sustained overload. Without backpressure, limits, and monitoring, backlogs grow until systems fail.

   David Iyanu Jonathan — DZone

Oof. The code exhausted all ephemeral ports and then they logged itself to death complaining about it. I love the workaround. Loopback is a /8!

  Jim Calabro — Bluesky

…and here’s an awesome analysis and explanation of the Bluesky writeup. I’ve definitely been down the path of scratching my head about EADDRINUSE before.

  Lorin Hochstein

AI didn’t solve the problem for them, but it sped up the grunt-work and significantly reduced their iteration time, letting them get to an answer much faster.

  Tristan Streichenberger — Mixpanel

It’s interesting to me that this is essentially an outage/degradation report, but the definition of system degradation for an LLM tool is much more subjective than with traditional services. The “ablation testing” concept is really neat.

  Anthropic

SRE Weekly Issue #514

Finally! Someone actually explaining how they test their SRE agent. Having a testing methodology is table stakes. Showing their work helps us decide whether we can trust the tool.

With so many SRE agents floating around, it’s quite surprising to me that this kind of article is so rare.

  Benjamin Barton — Datadog

An enlightening Deep dive into the way this Postgres resource management system evaluates the cost of queries in order to shed resource intensive ones.

  Patrick Reynolds — PlanetScale

If you’ve ever been in an incident where communication suddenly went quiet and access got restricted, this article explains why. The author breaks down five fundamental ways security incident response diverges from outage response — and why the instincts that make you effective at one can actively work against you in the other.

  Art Kondratiev — Uptime Labs

Security and reliability are inexorably intertwined. Examples: reliability failures leave security temporarily weak and vulnerable, and security changes have caused a number of recent high-profile outages/

  Oreoluwa Omoike — DZone

Some timely reminders about the realities of how autoscaling actually works in Kubernetes. It’s all about tuning your mental model.

   Ankush Madaan — DZone

There’s a limit to how far parallelism can get you, and it’s down to what part of your workload is by necessity serial.

[…] in practice, microservices that share a database or coordinate on every request are a distributed monolith with extra latency and a much harder debugging story.

   David Iyanu Jonathan — DZone

This is a great story, and I really liked the section on why traditional reliability techniques (autoscaling, circuit breakers, and rate limits) weren’t enough.

  Parveen Saini — DZone

SRE Weekly Issue #513

A message from our sponsor, incident.io:

“Lifting and shifting” noise to new tools just buys a different UI for the same burnout. incident.io’s migration framework prioritizes service cataloging and inventory to fix ownership, preventing team friction during the transition to a scalable on-call system.

A previously unpublished article by the late Dr. Richard Cook!

Organizational Second Hit Syndrome is an incident-related phenomenon analogous to neurological second-impact-syndrome (SIS). It occurs when a major incident creates a vulnerable period during which a second incident generates strong, widespread, and sometimes destructive organizational reactions.

  John Allspaw and Dr. Richard I. Cook — Adaptive Capacity Labs

Over 20k mounts to run 100 containers! And NUMA issues too. This one really drives home the fact that SREs need to be cognizant of all layers of the stack.

  Harshad Sane and Andrew Halaney — Netflix

Cost explosion is a reliability problem. I love the idea of surfacing sudden cost increase as an alert that something is probably going wrong.

   David Iyanu Jonathan — DZone

Autoscaling is reactive, not resilient. Without caps, metrics, or overrides, it can worsen failures. True elasticity requires policy, testing, and bottleneck awareness.

Raise your hand if your system has ever autoscaled itself to death. ✋

   David Iyanu Jonathan — DZone

Heinrich Hartmann argues AI’s most valuable role in SRE isn’t autonomous remediation. It’s making sure on-call engineers have the context to fix incidents fast.

  Peter Farago — RunLLM

As usual, I enjoy reading Lorin’s analysis of GitHub’s writeup on their incidents just as much as the writeup itself, if not more. Saturation, a security mechanism causing an outage, and more.

  Lorin Hochstein

Airbnb made a big move, migrating to a new observability stack. They explain how they structured the project to deliver a big win as early as possible, building buy-in.

  Callum Jones — Airbnb

Each one of these is like a pile of War Stories all gathered up into a tidy package of we can learn from.

  Karan Nagarajagowda — Uptime Labs

SRE Weekly Issue #512

A message from our sponsor, Archera:

AI workloads are unpredictable, which makes cloud commitments feel like a gamble. Archera insures your commitments against underutilization, so you can push coverage higher without the risk of getting stuck. If usage drops, Archera covers the downside. Commitment Release Guarantee included.

Start Saving

Improving robustness requires increasing complexity. Let’s throw more complexity at it?

I’m using this enormously complex system, an LLM, to help me solve a problem that was created by software complexity in the first place.

  Lorin Hochstein

This feels like using multiple agents as a sort of redundancy and cross-validation architecture to improve the reliability of agent output..

  Alex Ewerlöf

This article explains why end-to-end testing breaks down in microservice-based systems, not due to poor tooling, but because of fundamental architectural and operational mismatches.

   Alok Kumar — DZone

LaunchDarkly’s survey data show have some interesting things to say about the impact of AI.

[…] while build and deployment velocity have improved, production reliability has not.

  LaunchDarkly

Fred Hebert surveyed how AI coding assistants vs. AI SRE tools are marketed and found a stark divide: coding assistants are framed as partners that augment engineers, while AI SREs are framed as replacements for low-value work. The implication is that the people building and buying these tools see incident response as grunt work to be automated away — and that says a lot about how decision-makers perceive the role.

  Fred Hebert

I especially like the point that incidents are leadership moments — how you respond tells your team everything about the culture you’re building. This one is aimed at CTOs, but really it’s a great reminder for anyone in a leadership role during incidents.

  Joe Mckevitt — Uptime Labs

There’s a really interesting bit in this one about libraries and layers of the system doing their own retries without your knowledge, magnifying retry volume.

   David Iyanu Jonathan — DZone

I like the section on what AI should and shouldn’t do. It’s important to avoid automating away the process of learning from incidents.

  incident.io

SRE Weekly Issue #511

A message from our sponsor, Depot:

CI was designed for humans who context-switch while waiting. Agents don’t. They’re just blocked. Depot CEO Kyle Galbraith on how they re-imagined Depot CI to close the loop: run against local patches, rerun a single job, SSH into the runner to check reality. Per-second billing, no minimums.

Run depot ci migrate

This one’s definitely going to be good to keep in mind during my next incident.

FYI for folks with no or low vision, there’s a screenshot of J. Paul Reed quoting Vanessa Huerta Granda: “Incidents are where engineers are made”.

  Stuart Rimell — Uptime Labs

Etsy migrated a 1,000-table DB with 1,000 shards (with their own custom ORM!) over to vitess, and it took some care, especially in how they handled transactions.

  Ella Yarmo-Gray — Etsy

Wow, this one sure hits hard.

  Kenneth Eversole

The section on lessons learned toward the end of this debugging story is a goldmine.

  Lokesh Soni

How do you ensure reliability in a system you can’t access? How can you monitor SLIs/SLOs without metrics?

  Alex Ewerlöf

I love a good debugging story, and this one delivers, with a confluence of gnarly problems and lessons we can all learn from.

  James Sawyer — Phantom Tide

Oof, what a nasty little gotcha in the API call at the heart of this incident.

  David Tuber and Dzevad Trumic — Cloudflare

Lorin’s Law strikes again!

System intended to improve reliability contributed to incident

  Lorin Hochstein

A production of Tinker Tinker Tinker, LLC Frontier Theme