An incident write-up from the archives, and it’s a juicy one. An update to their code caused a crash only after some time had passed, so their automated testing didn’t catch it before they deployed it worldwide.
Xandr
This article covers an independent review of the Optus outage.
I personally find it astounding that somebody conducting an incident investigation would not delve deeper into how a decision that appears to be astounding would have made sense in the moment.
Lorin Hochstein
Cloudflare needed a tool to look for overlapping impact across their many maintenance events in order to avoid unintentionally impairing redundancy.
Kevin Deems and Michael Hoffmann — Cloudflare
Another great piece on expiration dates. I especially like the discussion of abrupt cliffs as a design choice.
Chris Siebenmann — University of Toronto
It’s not always easy to see how to automate a given bit of toil, especially when cross-team interactions are involved.
Thomas A. Limoncelli and Christian Pearce — ACM Queue
How do resilience and fault tolerance relate? Are they synonyms, do they overlap, or does one contain the other?
Uwe Friedrichsen
After unexpectedly losing their observability vendor, these folks were able to migrate to a new solution within a couple days.
Karan Abrol, Yating Zhou, Pratyush Verma, Aditya Bhandari, and Sameer Agarwal — Deductive.ai
A great dive into what blameless incident analysis really means.
Blameless also doesn’t mean you stop talking about what people did.
Busra Koken
