SRE Weekly Issue #516

Not all index scans are equal: How we cut query latency by over 99%

Just ensuring your query hits an index isn’t enough — it has to be using it well.

Nenad Noveljic and Bowen Chen — Datadog

AI in SRE: What’s Actually Coming in 2026

A practical look at where AI genuinely helps SRE teams, and what “AI-powered operations” can realistically deliver in production.

This one’s balanced: some optimism and excitement with a healthy dose of skepticism and caution.

Ashly Joseph and Jithu Paulose — DZone

Superficial Blamelessness

It’s not about avoiding naming names.

Be wary of successfully avoiding retribution, yet finding your post-incident process still biased towards an individualistic stance instead of a systemic one.

Fred Hebert — Resilience in Software Foundation

I Don’t Care if AI Wrote the Code. You Own It.

I love that this article takes the AI-and-code-ownership conversation all the way to production. It’s not enough to review what the AI wrote — if you’re not also the one carrying the pager for it, the accountability loop falls apart.

Peter Farago — RunLLM

Claude-powered AI coding agent deletes entire company database in 9 seconds — backups zapped, after Cursor tool powered by Anthropic’s Claude goes rogue

The confluence of agent failure with Railway’s behavior of deleting all backups makes this one especially noteworthy.

Mark Tyson — Tom’s Hardware

Finding zombies in our systems: A real-world story of CPU bottlenecks

A fun debugging story with a noteworthy cause. I’m gonna be keeping a closer eye on cgroups.

Vaibhav Shankar, Raymond Lee, Chia-Wei Chen, Shunyao Li, Yi Li, Ambud Sharma, Saurabh Vishwas Joshi, Charles-A. Francisco, Karthik Anantha Padmanabhan, and David Westbrook — Pinterest

Ten things not to worry about regarding oncall

It’s gonna be okay, really! If you’re going on-call for the first time, read this one. For the thousandth time? You should read it too.

Jos Visser

What AI Incident Response Leaves Behind

The Left-Over Principle: what’s left for humans to do when you’ve automated everything possible.

[…] each advance in AI incident response will render increasingly complex scenarios ‘Left-Over’ to human intelligence, which itself will be less and less prepared to deal with them.

Stuart Rimell — Uptime Labs

The normal work of creating reliability

Springing off from a LinkedIn comment by John Allspaw, this one goes into the differences between the Safety I and II approaches.

Lorin Hochstein

SRE Weekly Issue #516

Subscribe

RSS

Mastodon

Search Issues

A message from our sponsor, incident.io:

Subscribe

RSS

Mastodon

Search Issues