SRE Weekly Issue #503

The Abstraction Debt in Infrastructure as Code

Abstraction is meant to encapsulate complexity, but when done poorly, it creates opacity—a lack of visibility into what’s actually happening under the hood.

RoseSecurity

Fun with incident data and statistical process control

This article uses publicly available incident data and an open source tool to show that MTTR is not under statistical control, making it a useless metric.

Lorin Hochstein

The Glass Box AI SRE

Why should we trust an AI SRE Agent? This article describes a kind of agent that shows its sources and provides more detail when asked.

Presumably these folks are saying their agent meets this description, but this isn’t (directly) a marketing piece (except for the last 2 sentences).

RunLLM

Mitigating Application Resource Overload with Targeted Task Cancellation

The idea here is targeted load shedding, terminating tasks that are the likely cause of overload, using efficient heuristics.

Murat Demirbas — summary

YIGONG HU, ZEYIN ZHANG, YICHENG LIU, YILE GU, SHUANGYU LEI, and BARIS KASIKCI — original paper

AI and the ironies of automation – Part 2

Part 2 is just as good as the first, and I highly recommend reading it — along with the original Ironies of Automation paper.

Uwe Friedrichsen

Deploying the world’s largest GitLab instance 12 times daily

Take a deep technical dive into GitLab.com’s deployment pipeline, including progressive rollouts, Canary strategies, database migrations, and multiversion compatibility.

John Skarbek — GitLab

It works on my cluster: a tale of two troubleshooters

A fun debugging story with an unexpected resolution, plus a discussion of broader lessons learned.

Liam Mackie — Octopus Deploy

AWS re:Invent talk on their Oct ’25 incident

A review of AWS’s talk on their incident, with info about what new detail AWS shared and some key insights from the author.

Lorin Hochstein

Code Orange: Fail Small — our resilience plan following recent incidents

Cloudflare discusses what they’re doing in responsibility to their recent high-profile outages. They’re moving toward applying more structure and rigor to configuration deployments, like they already have for code deployments.

Dane Knecht — Cloudflare

SRE Weekly Issue #503

Subscribe

RSS

Mastodon

Search Issues