SRE Weekly Issue #163

A message from our sponsor, VictorOps:

Being on-call sucks. To make it better, sign up for the free webinar, “How to Make On-Call Suck Less”, to learn 5 simple steps you can take to improve the on-call experience and become a more efficient SRE team:

http://try.victorops.com/sreweekly/how-to-make-on-call-suck-less

Articles

Using an NTSB report on an airplane crash as a case study, this video presents three common traps we fall into in incident retrospectives:

  1. Counterfactual reasoning
  2. Normative language
  3. Mechanistic reasoning

I want to make this required material for all retrospective participants.

Dr. Johan Bergström — Lund University

Peak-shifting can save you and your customers money and make load easier to handle.

Lara PuReum Yim, Prashant Kumar, Raghav Garg, Preeti Kotamarthi, Ajmal Afif, Calvin Ng Tjioe, and Renrong Weng — Grab

These folks structured their on-call and incident response process around wombats (and sound guidelines and playbooks).

Wes Mason — npm

Lots of great stuff in this case study on an incident involving Chef and Apache. My favorite:

Enforcing processes arbitrarily as a way of trying to increase robustness often ends up making systems more fragile, as people will find ways to work around processes that frustrate their goals

Ryn Daniels — HashiCorp

Here’s how and why Grab rebuilt their logging pipeline around structured JSON events.

Aditya Praharaj — Grab

Don Miguel Ruiz’s Four Agreements as applied to incident response:

  1. Be Impeccable With Your Word
  2. Don’t Take Anything Personally
  3. Don’t Make Assumptions
  4. Always Do Your Best

Matt Stratton — PagerDuty

Outages

Updated: March 10, 2019 — 12:59 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme