SRE Weekly Issue #516

A message from our sponsor, incident.io:

Paging is just 10% of your incident workflow. incident.io’s 4-step framework turns migration into a forcing function for the other 90%: cut alert noise, fix service ownership, and build the on-call program your team actually deserves.

Just ensuring your query hits an index isn’t enough — it has to be using it well.

  Nenad Noveljic and Bowen Chen — Datadog

A practical look at where AI genuinely helps SRE teams, and what “AI-powered operations” can realistically deliver in production.

This one’s balanced: some optimism and excitement with a healthy dose of skepticism and caution.

   Ashly Joseph and Jithu Paulose — DZone

It’s not about avoiding naming names.

Be wary of successfully avoiding retribution, yet finding your post-incident process still biased towards an individualistic stance instead of a systemic one.

  Fred Hebert — Resilience in Software Foundation

I love that this article takes the AI-and-code-ownership conversation all the way to production. It’s not enough to review what the AI wrote — if you’re not also the one carrying the pager for it, the accountability loop falls apart.

  Peter Farago — RunLLM

The confluence of agent failure with Railway’s behavior of deleting all backups makes this one especially noteworthy.

  Mark Tyson — Tom’s Hardware

A fun debugging story with a noteworthy cause. I’m gonna be keeping a closer eye on cgroups.

  Vaibhav Shankar, Raymond Lee, Chia-Wei Chen, Shunyao Li, Yi Li, Ambud Sharma, Saurabh Vishwas Joshi, Charles-A. Francisco, Karthik Anantha Padmanabhan, and David Westbrook — Pinterest

It’s gonna be okay, really! If you’re going on-call for the first time, read this one. For the thousandth time? You should read it too.

  Jos Visser

The Left-Over Principle: what’s left for humans to do when you’ve automated everything possible.

[…] each advance in AI incident response will render increasingly complex scenarios ‘Left-Over’ to human intelligence, which itself will be less and less prepared to deal with them.

  Stuart Rimell — Uptime Labs

Springing off from a LinkedIn comment by John Allspaw, this one goes into the differences between the Safety I and II approaches.

  Lorin Hochstein

Updated: May 10, 2026 — 9:39 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme