SRE Weekly Issue #520

AI Agents Expose a Design Gap in Microservices Resilience

We build our systems against the usage patterns of human users, but agents fundamentally change the game.

Vineet Bhatkoti — DZone

AI agents are quietly generating chaos engineering failures enterprises don’t track yet

This is an interesting lens for exploring the risks that agents can introduce.

Sayali Patil — VentureBeat

Reddit r/sre: How long does your company give new people before they put them oncall

Great discussion in the comments! There’s a lot of variance in how much time people recommend. I personally tend to lean earlier — on-call is a great way to learn, and I can always reach out if I get stuck.

u/modern_medicine_isnt and commenters — Reddit r/sre

Metastable Failures Explained: Why Fixing the Trigger Fails

A great into to the concept of metastable failures — and I recommend reading the original paper as well.

Teiva Harsanyi

Most Companies Wait Too Long to Declare Incidents

The real issue is that your company has made declaring an incident costly and risky for the person who does it.

Brent Chapman

A postmortem of our May 7, 2026 outage

I enjoyed learning about their deliberate architectural choice to keep their central service in a single AZ. This incident highlighted a need for a fast failover plan.

Coinbase

Customers over control: how we measure On-call reliability

I like the balance between ensuring 99.99% reliability and designing their product to encourage customers to use their platform in a way that effectively manages the 0.01% case.

Reliability is a customer experience problem

Mike Fisher — incident.io

The demon of the gaps

I’m not gonna spoil this one for you by writing a summary. Just read it, trust me.

Lorin Hochstein

SRE Weekly Issue #520

Subscribe

RSS

Mastodon

Search Issues

A message from our sponsor, BigPanda:

Subscribe

RSS

Mastodon

Search Issues