This article contains:
two reasons why site reliability engineers may be part of IT teams for years to come, and two reasons why site reliability engineering may turn out just to be a fad.
Christopher Tozzi — ITPro Today
This article proposes an interesting method for incident investigation: constantly try to disprove your hypotheses to avoid confirmation bias.
Ivan Merill — Fiberplane
How I’ve managed to run this newsletter for almost 7 years without a single mention of the Therac-25 incidents is beyond me. Therac-25 is an important lesson for all of us as we design systems and analyze incidents.
Adam Fabio — Hackaday
Even though this happened 14 years ago, the cause is very much still relevant today. If you have two bit-flips in the same TCP packet, it’ll still pass the checksum.
Poppy Linden — Linden Lab
This article proposes two criteria: Actionability and Investigability.
This incident write-up chronicles an incident in which a poison pill message repeatedly crashed their Heroku app.
Lawrence Jones — incident.io
Take this one with a grain of salt since there’s a fair bit of counterfactual reasoning in the description. Nevertheless there’s a lot to learn from this and Wikipedia’s article on the same accident.