Just ensuring your query hits an index isn’t enough — it has to be using it well.
Nenad Noveljic and Bowen Chen — Datadog
A practical look at where AI genuinely helps SRE teams, and what “AI-powered operations” can realistically deliver in production.
This one’s balanced: some optimism and excitement with a healthy dose of skepticism and caution.
Ashly Joseph and Jithu Paulose — DZone
It’s not about avoiding naming names.
Be wary of successfully avoiding retribution, yet finding your post-incident process still biased towards an individualistic stance instead of a systemic one.
Fred Hebert — Resilience in Software Foundation
I love that this article takes the AI-and-code-ownership conversation all the way to production. It’s not enough to review what the AI wrote — if you’re not also the one carrying the pager for it, the accountability loop falls apart.
Peter Farago — RunLLM
The confluence of agent failure with Railway’s behavior of deleting all backups makes this one especially noteworthy.
Mark Tyson — Tom’s Hardware
A fun debugging story with a noteworthy cause. I’m gonna be keeping a closer eye on cgroups.
Vaibhav Shankar, Raymond Lee, Chia-Wei Chen, Shunyao Li, Yi Li, Ambud Sharma, Saurabh Vishwas Joshi, Charles-A. Francisco, Karthik Anantha Padmanabhan, and David Westbrook — Pinterest
It’s gonna be okay, really! If you’re going on-call for the first time, read this one. For the thousandth time? You should read it too.
Jos Visser
The Left-Over Principle: what’s left for humans to do when you’ve automated everything possible.
[…] each advance in AI incident response will render increasingly complex scenarios ‘Left-Over’ to human intelligence, which itself will be less and less prepared to deal with them.
Stuart Rimell — Uptime Labs
Springing off from a LinkedIn comment by John Allspaw, this one goes into the differences between the Safety I and II approaches.
Lorin Hochstein
