This is a truly outstanding article about blameless incident analysis! Beyond just “why”, it covers many of the pitfalls that trip people up when they try to enact a blameless culture, including questions about accountability.
fgj
Here’s a good reminder that resilience in our systems is all about the humans.
Stuart Rimell
This article outlines WarpStream’s solution to a common problem in systems based on shared storage (like S3): cleaning up objects that are no longer needed, at scale.
Richard Artoul — WarpStream
I love learning how companies structure their on-call rota. My favorite part of this one is the emphasis on keeping the manager in the rota as a feedback mechanism.
Laura de Vesine and David Lentz — Datadog
These folks continuously detect drift by running terraform plan
and alerting on changes that have no corresponding commit in git.
Yugandhar Suthari
It’s a troubleshooting story having nothing to do with tech, but the technique used can easily apply to your next incident.
Paige Cruz
Some examples you may not have thought of that can lead to Terraform drift, along with an exploration of the problems drift can bring.
Saijal Shrivastava — Razorpay
Railway had an outage this week related to their control plane database, and they shared this write-up.
Ray Chen — Railway