SRE Weekly Issue #510

Building SRE Error Budgets for AI/ML Workloads: A Practical Framework

ML systems decay gradually instead of breaking suddenly, so we need error budgets for model accuracy, data freshness, and fairness — not just uptime.

Varun Kumar Reddy Gajjala — DZone

Why Enterprises Overfund Failure and Underfund Prevention

Enterprises rarely fail because they don’t care about reliability.
They fail because:

failure is loud,

prevention is quiet,

and budgeting systems are wired to respond to noise.

Florian Hoeppner

Automating RDS Postgres to Aurora Postgres Migration

They had hundreds of databases to migrate, so they built a tested, self-service migration workflow.

Ram Srivasta Kannan, Wale Akintayo, Jay Bharadwaj, John Crimmins, Shengwei Wang, and Zhitao Zhu — Netflix

Shedding old code with ecdysis: graceful restarts for Rust services at Cloudflare

I love the technical description of socket juggling to achieve a graceful restart. I could swear that this technique has been around for decades though, for example in TinyMUX et al…

Manuel Olguín Muñoz — Cloudflare

Lots of AI SRE, no AI incident management

Lorin goes into what an AI incident manager might look like, since no tools of the sort exist yet.

Lorin Hochstein

When Kubernetes Forgets: The 90-Second Evidence Gap

By default, Kubernetes keeps a pretty short event history. This article argues that what we really need is the ability to know the state of the system at a specific time.

Shamsher Khan — DZone

Safeguarding dynamic configuration changes at scale

They built a platform for safely rolling out configuration changes. I like that it has a special mode for use in incident response.

Cosmo W. Q — Airbnb

Catching a caching bug at Readyset

This is a cool debugging story, and I love the emphasis on mental models. The bit about simulating different paths through the software is quite intriguing.

Michael Victor Zink — Readyset (via Antithesis)

SRE Weekly Issue #510

Subscribe

RSS

Mastodon

Search Issues

A message from our sponsor, Clickhouse:

Subscribe

RSS

Mastodon

Search Issues