They give solid examples to argue that much of the learning happens during the process of writing a post-incident review.
[…] you could throw the post-incident review document away after writing it and still get the vast majority of the value out of the process.
Brent Chapman
I really like this idea of change absorption capacity.
Priya Gopalsamy — Stack Overflow
A useful guide that covers strategies for benchmarking, along with pitfalls to avoid.
Ben Dicken — PlanetScale
Serverless isn’t inherently cheaper. Hidden costs add up, and at scale it’s often pricier than containers — best for sporadic, not steady workloads.
David Iyanu Jonathan — DZone
With just under 4.5 minutes of leeway for outages per month, you have to rely on automated remediation. AI can help, but it’s not a full solution, per this article.
Norberto Lopes — incident.io
LLMs are specifically designed to generate plausible-seeming output, and this makes reviewing especially difficult.
Diomidis Spinellis
A breakdown of the 28-hour aws us-east-1 outage in may 2026. What caused it, what went down, and what it means for how you design your infrastructure.
Alon Shrestha
This article has a list of common problems in incident response, and I feel like printing it and taping it to my wall.
Karan Nagarajagowda — Uptime Labs
