This is a presentation by Laura Nolan (with text transcript) all about cascading failure, what causes it, how to avoid it, and how to deal with it when it happens.
I love how succinct this is:
[…] in any system where we design to fail over, so any mechanism at all that redistributes load from a failed component to still working components, we create the potential for a cascading failure to happen.
Laura Nolan — Slack (presented at InfoQ)
It’s so easy to explain an incident by describing how management could have prevented it from investing additional resources.
Lorin goes on to explain the “trap” part: it’s easy to stop investigating an incident too soon and declare the cause “greedy executives”, preventing us from learning more.
They redesigned one of their caching systems in 2020, and it paid off handsomely during the GameStop saga. This article discusses the redesign and considers what would have happened without it.
Garrett Hoffman — Reddit
The lessons are:
- Do retrospectives for small incidents first.
- Do a retrospective soon after the incident.
- Alert on the user experience.
All great advice, and #1 is an interesting idea I hadn’t heard before.
Robert Ross — FireHydrant
We can’t engineer reliability in a vacuum. This is a great explainer on how SRE siloing happens, the problems it causes, and how to break SRE out of its shell.
JJ Tang — Rootly
This ASRS (Aviation Safety Reporting System) Callback issue has some real-world examples of resilient systems in action.
Facing a common kubernetes node failure modes, Cloudflare uses open source tools (one published by them) to perform automatic restarts.
In the past 30 days, we’ve used the above automatic node remediation process to action 571 nodes. That has saved our humans a considerable amount of time.
Andrew DeMaria — Cloudflare