SRE Weekly Issue #476

The myth is:

The underlying and often unexamined assumption for the benefits of automation is the notion that computers/machines are better at some tasks, and humans are better at a different, non-overlapping set of tasks.

This article lays out several pitfalls to this approach, with references.

  Courtney Nash

Wow, I seriously love this one. It’s written in an a very approachable style that’s easy to understand from the outside. It lays a series of cringe-worthy contributing factors that could happen to any of us, making them a great learning opportunity.

  Spotify

This is the first time I’ve come across the term “grounding” in incident response, and I like it!

At the core of our vision lies the principle of grounding, drawn from safety-critical systems like aviation and the fire service industries. Grounding is the process of maintaining a shared understanding among team members throughout the course of an incident.

  Uptime Labs

I really like the idea of using formal modeling on distributed systems. Datadog explains how they did it when building a new message queuing service.

  Arun Parthiban, Sesh Nalla, and Cecilia Wat-Kim

I found this to be a really useful primer on the new EU AI regulation. It does transition into a sales pitch toward the end, but the pre-pitch content is substantial.

  Chris Evans — incident.io

A classic example of Lorin’s Law: work intended to improve reliability was at the heart of this incident.

  Railway

Feature flags are incredibly useful, but they have some gotchas too.

  Tom Elliott

More potential problems to watch out for with feature flags, but this one ends by emphasizing that feature flags are still an important tool. Bonus points for a Knight Capital Incident mention.

  Ian Vanagas

Updated: May 11, 2025 — 9:16 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme