SRE Weekly Issue #163

Articles

Three analytical traps in accident investigation (YouTube, 7:36)

Using an NTSB report on an airplane crash as a case study, this video presents three common traps we fall into in incident retrospectives:

Counterfactual reasoning

Normative language

Mechanistic reasoning

I want to make this required material for all retrospective participants.

Dr. Johan Bergström — Lund University

Recipe for building a widget: How we helped to “peak-shift” demand by helping passengers understand travel trends

Peak-shifting can save you and your customers money and make load easier to handle.

Lara PuReum Yim, Prashant Kumar, Raghav Garg, Preeti Kotamarthi, Ajmal Afif, Calvin Ng Tjioe, and Renrong Weng — Grab

npm On-Call

These folks structured their on-call and incident response process around wombats (and sound guidelines and playbooks).

Wes Mason — npm

Crafting a Resilient Culture: Or, How to Survive an Accidental Mid-Day Production Incident

Lots of great stuff in this case study on an incident involving Chef and Apache. My favorite:

Enforcing processes arbitrarily as a way of trying to increase robustness often ends up making systems more fragile, as people will find ways to work around processes that frustrate their goals

Ryn Daniels — HashiCorp

Structured Logging: The Best Friend You’ll Want When Things Go Wrong

Here’s how and why Grab rebuilt their logging pipeline around structured JSON events.

Aditya Praharaj — Grab

The Four Agreements of Incident Response

Don Miguel Ruiz’s Four Agreements as applied to incident response:

Be Impeccable With Your Word

Don’t Take Anything Personally

Don’t Make Assumptions

Always Do Your Best

Matt Stratton — PagerDuty

Outages

Duo Security
- Retrospective analysis included.
Google Cloud Platform (Cloud Routers)
Fastly
Wells Fargo
Hosted Graphite

SRE Weekly Issue #163

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

A message from our sponsor, VictorOps:

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues