Abstraction is meant to encapsulate complexity, but when done poorly, it creates opacity—a lack of visibility into what’s actually happening under the hood.
RoseSecurity
This article uses publicly available incident data and an open source tool to show that MTTR is not under statistical control, making it a useless metric.
Lorin Hochstein
Why should we trust an AI SRE Agent? This article describes a kind of agent that shows its sources and provides more detail when asked.
Presumably these folks are saying their agent meets this description, but this isn’t (directly) a marketing piece (except for the last 2 sentences).
RunLLM
The idea here is targeted load shedding, terminating tasks that are the likely cause of overload, using efficient heuristics.
Murat Demirbas — summary
YIGONG HU, ZEYIN ZHANG, YICHENG LIU, YILE GU, SHUANGYU LEI, and BARIS KASIKCI — original paper
Part 2 is just as good as the first, and I highly recommend reading it — along with the original Ironies of Automation paper.
Uwe Friedrichsen
Take a deep technical dive into GitLab.com’s deployment pipeline, including progressive rollouts, Canary strategies, database migrations, and multiversion compatibility.
John Skarbek — GitLab
A fun debugging story with an unexpected resolution, plus a discussion of broader lessons learned.
Liam Mackie — Octopus Deploy
A review of AWS’s talk on their incident, with info about what new detail AWS shared and some key insights from the author.
Lorin Hochstein
Cloudflare discusses what they’re doing in responsibility to their recent high-profile outages. They’re moving toward applying more structure and rigor to configuration deployments, like they already have for code deployments.
Dane Knecht — Cloudflare
