SRE Weekly Issue #353

A message from our sponsor, Rootly:

Manage incidents directly from Slack with Rootly 🚒.

Rootly automates manual tasks like creating an incident channel, Jira ticket and Zoom rooms, inviting responders, creating statuspage updates, postmortem timelines and more. Want to see why companies like Canva and Grammarly love us?:


This article contains:

two reasons why site reliability engineers may be part of IT teams for years to come, and two reasons why site reliability engineering may turn out just to be a fad.

  Christopher Tozzi — ITPro Today

This article proposes an interesting method for incident investigation: constantly try to disprove your hypotheses to avoid confirmation bias.

  Ivan Merill — Fiberplane

How I’ve managed to run this newsletter for almost 7 years without a single mention of the Therac-25 incidents is beyond me. Therac-25 is an important lesson for all of us as we design systems and analyze incidents.

  Adam Fabio — Hackaday

Even though this happened 14 years ago, the cause is very much still relevant today. If you have two bit-flips in the same TCP packet, it’ll still pass the checksum.

  Poppy Linden — Linden Lab

This article proposes two criteria: Actionability and Investigability.

  Dan Slimmon

This incident write-up chronicles an incident in which a poison pill message repeatedly crashed their Heroku app.

  Lawrence Jones —

Take this one with a grain of salt since there’s a fair bit of counterfactual reasoning in the description. Nevertheless there’s a lot to learn from this and Wikipedia’s article on the same accident.

  Admiral Cloudberg

Updated: December 26, 2022 — 4:01 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme