SRE Weekly Issue #520

A message from our sponsor, BigPanda:

Your team solved this incident last month. Why is it back?

Because you fixed the symptom, not the cause. BigPanda surfaces the pattern behind repeat incidents and tells you what to fix so the next on-call doesn’t fight the same P1.

Prevent incidents proactively

We build our systems against the usage patterns of human users, but agents fundamentally change the game.

   Vineet Bhatkoti — DZone

This is an interesting lens for exploring the risks that agents can introduce.

  Sayali Patil — VentureBeat

Great discussion in the comments! There’s a lot of variance in how much time people recommend. I personally tend to lean earlier — on-call is a great way to learn, and I can always reach out if I get stuck.

  u/modern_medicine_isnt and commenters — Reddit r/sre

A great into to the concept of metastable failures — and I recommend reading the original paper as well.

  Teiva Harsanyi

The real issue is that your company has made declaring an incident costly and risky for the person who does it.

  Brent Chapman

I enjoyed learning about their deliberate architectural choice to keep their central service in a single AZ. This incident highlighted a need for a fast failover plan.

  Coinbase

I like the balance between ensuring 99.99% reliability and designing their product to encourage customers to use their platform in a way that effectively manages the 0.01% case.

Reliability is a customer experience problem

  Mike Fisher — incident.io

I’m not gonna spoil this one for you by writing a summary. Just read it, trust me.

  Lorin Hochstein

Updated: June 7, 2026 — 9:21 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme