SRE Weekly Issue #353

This article contains:

two reasons why site reliability engineers may be part of IT teams for years to come, and two reasons why site reliability engineering may turn out just to be a fad.

  Christopher Tozzi — ITPro Today

This article proposes an interesting method for incident investigation: constantly try to disprove your hypotheses to avoid confirmation bias.

  Ivan Merill — Fiberplane

How I’ve managed to run this newsletter for almost 7 years without a single mention of the Therac-25 incidents is beyond me. Therac-25 is an important lesson for all of us as we design systems and analyze incidents.

  Adam Fabio — Hackaday

Even though this happened 14 years ago, the cause is very much still relevant today. If you have two bit-flips in the same TCP packet, it’ll still pass the checksum.

  Poppy Linden — Linden Lab

This article proposes two criteria: Actionability and Investigability.

  Dan Slimmon

This incident write-up chronicles an incident in which a poison pill message repeatedly crashed their Heroku app.

  Lawrence Jones —

Take this one with a grain of salt since there’s a fair bit of counterfactual reasoning in the description. Nevertheless there’s a lot to learn from this and Wikipedia’s article on the same accident.

  Admiral Cloudberg

