This talk summary explores the concept that “error” is a concept applied to an event from the outside, rather than a simple fact. What can this tell us about our after-incident investigation process?
Here’s a deep dive into a performance degradation in Cloudflare last December that was related to missing error handling in a shell script.
Alex Forster — Cloudflare
Atlassian is having a tough time. It seems as if they deleted a few hundred customers’ data and have to pull it out of their backups one at a time.
Here’s another article about the outage (Steven J. Vaughan-Nichols — The New Stack).
Gergely Orosz — Pragmatic Engineer
Cool trick: their client library can fall back to a backup domain if DNS ably.io fails.
Jo Stichbury — Ably
It still wasn’t quite DNS, it was an interesting situation with the Linux kernel’s martian packet detection algorithm.
Laurent Bernaille and David Lentz — DataDog
Aside from the terrifying risk of nuclear war, this sounds very similar to the kind of complex system failures SREs deal with routinely.
Zia Mian, M. V. Ramana — Scientific American
Both approaches have their pros and cons. The right strategy for your company or team depends, of course, on your needs and priorities.
Quentin Rousseau — Rootly
This article is published by my sponsor, Rootly, but their sponsorship did not influence its inclusion in this issue.