SRE Weekly Issue #474

Why do we do blameless incident reviews?

This is a truly outstanding article about blameless incident analysis! Beyond just “why”, it covers many of the pitfalls that trip people up when they try to enact a blameless culture, including questions about accountability.

fgj

Tech without us: Why there wasn’t an outage today

Here’s a good reminder that resilience in our systems is all about the humans.

Stuart Rimell

Taking out the Trash: Garbage Collection of Object Storage at Massive Scale

This article outlines WarpStream’s solution to a common problem in systems based on shared storage (like S3): cleaning up objects that are no longer needed, at scale.

Richard Artoul — WarpStream

How we structure on call rotations at Datadog

I love learning how companies structure their on-call rota. My favorite part of this one is the emphasis on keeping the manager in the rota as a feedback mechanism.

Laura de Vesine and David Lentz — Datadog

Terraform Drift Detection: How to Catch Configuration Drift

These folks continuously detect drift by running terraform plan and alerting on changes that have no corresponding commit in git.

Yugandhar Suthari

On Describing Not Explaining

It’s a troubleshooting story having nothing to do with tech, but the technique used can easily apply to your next incident.

Paige Cruz

The Dark Side of Terraform: Drifts, Chaos, and the Headaches They Bring

Some examples you may not have thought of that can lead to Terraform drift, along with an exploration of the problems drift can bring.

Saijal Shrivastava — Razorpay

Incident Report: April 23rd, 2025

Railway had an outage this week related to their control plane database, and they shared this write-up.

Ray Chen — Railway

SRE Weekly Issue #474

Subscribe

RSS

Mastodon

Search Issues

A message from our sponsor, incident.io:

Subscribe

RSS

Mastodon

Search Issues