SRE Weekly Issue #503

Abstraction is meant to encapsulate complexity, but when done poorly, it creates opacity—a lack of visibility into what’s actually happening under the hood.

  RoseSecurity

This article uses publicly available incident data and an open source tool to show that MTTR is not under statistical control, making it a useless metric.

  Lorin Hochstein

Why should we trust an AI SRE Agent? This article describes a kind of agent that shows its sources and provides more detail when asked.

Presumably these folks are saying their agent meets this description, but this isn’t (directly) a marketing piece (except for the last 2 sentences).

  RunLLM

The idea here is targeted load shedding, terminating tasks that are the likely cause of overload, using efficient heuristics.

  Murat Demirbas — summary

  YIGONG HU, ZEYIN ZHANG, YICHENG LIU, YILE GU, SHUANGYU LEI, and BARIS KASIKCI — original paper

Part 2 is just as good as the first, and I highly recommend reading it — along with the original Ironies of Automation paper.

  Uwe Friedrichsen

Take a deep technical dive into GitLab.com’s deployment pipeline, including progressive rollouts, Canary strategies, database migrations, and multiversion compatibility.

  John Skarbek — GitLab

A fun debugging story with an unexpected resolution, plus a discussion of broader lessons learned.

  Liam Mackie — Octopus Deploy

A review of AWS’s talk on their incident, with info about what new detail AWS shared and some key insights from the author.

  Lorin Hochstein

Cloudflare discusses what they’re doing in responsibility to their recent high-profile outages. They’re moving toward applying more structure and rigor to configuration deployments, like they already have for code deployments.

  Dane Knecht — Cloudflare

Updated: December 28, 2025 — 9:46 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme