Articles
Using an NTSB report on an airplane crash as a case study, this video presents three common traps we fall into in incident retrospectives:
- Counterfactual reasoning
- Normative language
- Mechanistic reasoning
I want to make this required material for all retrospective participants.
Dr. Johan Bergström — Lund University
Peak-shifting can save you and your customers money and make load easier to handle.
Lara PuReum Yim, Prashant Kumar, Raghav Garg, Preeti Kotamarthi, Ajmal Afif, Calvin Ng Tjioe, and Renrong Weng — Grab
These folks structured their on-call and incident response process around wombats (and sound guidelines and playbooks).
Wes Mason — npm
Lots of great stuff in this case study on an incident involving Chef and Apache. My favorite:
Enforcing processes arbitrarily as a way of trying to increase robustness often ends up making systems more fragile, as people will find ways to work around processes that frustrate their goals
Ryn Daniels — HashiCorp
Here’s how and why Grab rebuilt their logging pipeline around structured JSON events.
Aditya Praharaj — Grab
Don Miguel Ruiz’s Four Agreements as applied to incident response:
- Be Impeccable With Your Word
- Don’t Take Anything Personally
- Don’t Make Assumptions
- Always Do Your Best
Matt Stratton — PagerDuty
Outages
- Duo Security
- Retrospective analysis included.
- Google Cloud Platform (Cloud Routers)
- Fastly
- Wells Fargo
- Hosted Graphite