SRE Weekly Issue #474

A message from our sponsor, incident.io:

We’ve just raised $62M at incident.io to build AI agents that resolve incidents with you. See how we’re pioneering a new era of incident management.

https://go.incident.io/blog/incident.io-raises-62m

This is a truly outstanding article about blameless incident analysis! Beyond just “why”, it covers many of the pitfalls that trip people up when they try to enact a blameless culture, including questions about accountability.

  fgj

Here’s a good reminder that resilience in our systems is all about the humans.

  Stuart Rimell

This article outlines WarpStream’s solution to a common problem in systems based on shared storage (like S3): cleaning up objects that are no longer needed, at scale.

  Richard Artoul — WarpStream

I love learning how companies structure their on-call rota. My favorite part of this one is the emphasis on keeping the manager in the rota as a feedback mechanism.

  Laura de Vesine and David Lentz — Datadog

These folks continuously detect drift by running terraform plan and alerting on changes that have no corresponding commit in git.

   Yugandhar Suthari

It’s a troubleshooting story having nothing to do with tech, but the technique used can easily apply to your next incident.

  Paige Cruz

Some examples you may not have thought of that can lead to Terraform drift, along with an exploration of the problems drift can bring.

  Saijal Shrivastava — Razorpay

Railway had an outage this week related to their control plane database, and they shared this write-up.

  Ray Chen — Railway

Updated: April 27, 2025 — 9:55 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme