ML systems decay gradually instead of breaking suddenly, so we need error budgets for model accuracy, data freshness, and fairness — not just uptime.
Varun Kumar Reddy Gajjala — DZone
Enterprises rarely fail because they don’t care about reliability.
They fail because:
- failure is loud,
- prevention is quiet,
- and budgeting systems are wired to respond to noise.
Florian Hoeppner
They had hundreds of databases to migrate, so they built a tested, self-service migration workflow.
Ram Srivasta Kannan, Wale Akintayo, Jay Bharadwaj, John Crimmins, Shengwei Wang, and Zhitao Zhu — Netflix
I love the technical description of socket juggling to achieve a graceful restart. I could swear that this technique has been around for decades though, for example in TinyMUX et al…
Manuel Olguín Muñoz — Cloudflare
Lorin goes into what an AI incident manager might look like, since no tools of the sort exist yet.
Lorin Hochstein
By default, Kubernetes keeps a pretty short event history. This article argues that what we really need is the ability to know the state of the system at a specific time.
Shamsher Khan — DZone
They built a platform for safely rolling out configuration changes. I like that it has a special mode for use in incident response.
Cosmo W. Q — Airbnb
This is a cool debugging story, and I love the emphasis on mental models. The bit about simulating different paths through the software is quite intriguing.
Michael Victor Zink — Readyset (via Antithesis)
