SRE Weekly Issue #519

A message from our sponsor, BigPanda:

What if you could predict which changes will cause incidents?

BigPanda analyzes every change, including ones marked safe, to surface the real risk and impact before deployment. Next time, routine changes don’t become your next P1.

See BigPanda for SREs

They give solid examples to argue that much of the learning happens during the process of writing a post-incident review.

[…] you could throw the post-incident review document away after writing it and still get the vast majority of the value out of the process.

  Brent Chapman

I really like this idea of change absorption capacity.

  Priya Gopalsamy — Stack Overflow

A useful guide that covers strategies for benchmarking, along with pitfalls to avoid.

  Ben Dicken — PlanetScale

Serverless isn’t inherently cheaper. Hidden costs add up, and at scale it’s often pricier than containers — best for sporadic, not steady workloads.

   David Iyanu Jonathan — DZone

With just under 4.5 minutes of leeway for outages per month, you have to rely on automated remediation. AI can help, but it’s not a full solution, per this article.

  Norberto Lopes — incident.io

LLMs are specifically designed to generate plausible-seeming output, and this makes reviewing especially difficult.

  Diomidis Spinellis

A breakdown of the 28-hour aws us-east-1 outage in may 2026. What caused it, what went down, and what it means for how you design your infrastructure.

  Alon Shrestha

This article has a list of common problems in incident response, and I feel like printing it and taping it to my wall.

  Karan Nagarajagowda — Uptime Labs

Updated: May 31, 2026 — 10:22 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme