General

SRE Weekly Issue #164

A message from our sponsor, VictorOps:

Start making on-call suck less. Last chance to register for the free VictorOps webinar where you can learn about using automation and improved collaboration to create a better on-call experience:

http://try.victorops.com/sreweekly/how-to-make-on-call-suck-less

Articles

I previously shared an article about the 737 MAX 8, and I’m truly saddened that another accident has occurred. Learning from accidents like this is incredibly important, and the NTSB is among the best at it. I look forward to see what we can take away from this to make air travel even safer.

Farnoush Amiri and Ben Kesslen — NBC

The existence of this anonymous channel for pilots is really interesting to me. It sounds like a great way to learn about near misses, which can be nearly identical to catastrophic accidents. Can we implement this kind of anonymous channel in our organizations too?

Thom Patterson and Aaron Cooper — CNN

“Aviation accidents are rarely the result of a single cause,” Lewis noted. “There are often many small things that lead to a crash, and that’s why these investigations take so long.”

Francesca Paris — NPR

Google and other companies are working on their own private undersea cables.

‘People think that data is in the cloud, but it’s not. It’s in the ocean.’

Adam Satariano — New York Times

For this week, I have an article about on-call and how its done at NASA. Many of the conclusions here may not be that surprising to those who have been on-call for any length of time, but I think there is a lot to learn from how NASA makes the system work.

Thai Wood — Resilience Roundup (summary)

Emily S Patterson and David D Woods — Ohio State University (original article)

I hadn’t thought of this before, but I really like this idea:

The facilitator’s role in the meeting is different from the other participants. They do not voice their own ideas, but keep the discussion on track and encourage the group to speak up.

Rachael Byrne — PagerDuty

Outages

SRE Weekly Issue #163

A message from our sponsor, VictorOps:

Being on-call sucks. To make it better, sign up for the free webinar, “How to Make On-Call Suck Less”, to learn 5 simple steps you can take to improve the on-call experience and become a more efficient SRE team:

http://try.victorops.com/sreweekly/how-to-make-on-call-suck-less

Articles

Using an NTSB report on an airplane crash as a case study, this video presents three common traps we fall into in incident retrospectives:

  1. Counterfactual reasoning
  2. Normative language
  3. Mechanistic reasoning

I want to make this required material for all retrospective participants.

Dr. Johan Bergström — Lund University

Peak-shifting can save you and your customers money and make load easier to handle.

Lara PuReum Yim, Prashant Kumar, Raghav Garg, Preeti Kotamarthi, Ajmal Afif, Calvin Ng Tjioe, and Renrong Weng — Grab

These folks structured their on-call and incident response process around wombats (and sound guidelines and playbooks).

Wes Mason — npm

Lots of great stuff in this case study on an incident involving Chef and Apache. My favorite:

Enforcing processes arbitrarily as a way of trying to increase robustness often ends up making systems more fragile, as people will find ways to work around processes that frustrate their goals

Ryn Daniels — HashiCorp

Here’s how and why Grab rebuilt their logging pipeline around structured JSON events.

Aditya Praharaj — Grab

Don Miguel Ruiz’s Four Agreements as applied to incident response:

  1. Be Impeccable With Your Word
  2. Don’t Take Anything Personally
  3. Don’t Make Assumptions
  4. Always Do Your Best

Matt Stratton — PagerDuty

Outages

SRE Weekly Issue #162

A message from our sponsor, VictorOps:

Ever been on-call? Then you know it can suck. Check out some of our tips and tricks to see how SRE teams are maintaining composure during a critical incident and making on-call suck less:

http://try.victorops.com/sreweekly/on-call-stress-management

Articles

Want to nerd out on BGP? Check out how this person modeled the Eve Online universe as an 8000-VM cluster running BGP.

Ben Cartwright-Cox

Accrued vacation time is antiquated, and “unlimited” vacation paradoxically leads employees to take less time overall. Time to enforce vacations, lest we forget that burnout is a reliability risk.

Baron Schwartz

How to avoid catastrophe: pay attention to near misses. This article makes an incredibly compelling point that we need to make a conscious effort to pay attention to near misses, and explains how cognitive bias will tend to make us do the exact opposite.

Catherine H. Tinsley, Robin L. Dillon, and Peter M. Madsen — Harvard Business Review

An intro to how blame causes problems, why blamelessness is better, and how to adopt a blameless culture.

Ashar Rizqi

A 100-year-old chemical company thought they had a great safety record. Turns out that folks were just considering accidents “routine” and not reporting them.

Thai Wood (reviewing a paper by Stefanie Huber, Ivette van Wijgerden, Arjan de Witt, and Sidney W.A. Dekker)

Booking.com has 50,000 servers and many SRE squads. They developed tools they call the Reliability Collaboration Model and the Ownership Map to help them define which products SRE squads support and at what level.

Emmanuel Goossaert — Booking.com

Outages

  • New Relic
  • Duo Security
  • Amtrak (US long-distance passenger rail)
    • Amtrak had an outage of its switching system this past week. Linked above is an article with the inflammatory title, “Human error? Try abject stupidity, Amtrak”. Exercise: try to think of ways in which this is not a case of abject stupidity.

      Rich Miller — Capitol Fax

  • YouTube

SRE Weekly Issue #161

A message from our sponsor, VictorOps:

Being on-call can suck. Without alert context or a collaborative system for incident response, SRE teams will have trouble appropriately responding to on-call incidents. Check out The On-Call Template to become the master of on-call and improve service reliability:

http://try.victorops.com/sreweekly/the-on-call-template

Articles

I’m not a fan of error budgets. I’ve never seen them implemented particularly well up close, though I know lots of folks who say it works for them.

I’ve started to feel a bit sour on the whole error budget thing, but I couldn’t really pin down why. This article really nails it.

Will Gallego

Will Gallego is my co-worker, although I came across this article separately.

I’m still hooked on flight accident case studies. In this one, mission fixation and indecision lead to disaster.

Air Safety Institute

If I was setting up curriculum at a university I’d make an entire semester-long class on The Challenger disaster, and make it required for any remotely STEM-oriented major.

This awesome article is about getting so used to pushing the limits that you forget you’re even doing it, until disaster strikes.

Foone Turing

A couple weeks back, I linked to a survey about compensation for on-call. Here’s an analysis of the results and some raw data in case you want to tinker with it.

Chris Evans and Spike Lindsey

Learn how this company does incident management drills. They seem to handle things much like a real incident, including doing a retrospective afterward!

Tim Little — Kudos

Outages

SRE Weekly Issue #160

A message from our sponsor, VictorOps:

Establishing an effective post-incident review process and taking the time to execute on it makes a world of difference in software reliability. See this example of a post-incident review process that’s already helping SRE teams continuously improve:

http://try.victorops.com/sreweekly/post-incident-review-process

Articles

This is a long one, but trust me, it’s worth the read. My favorite part is where the author gets into mental models, hearkening back to the Stella Report.

Fred Hebert

When CDN outages occur, it becomes immediately clear who is using multiple CDNs and who is not.

A multi-CDN approach can be tricky to pull off, but as these folks explain, it can be critical for reliability and performance.

Scott Kidder — mUX

Full disclosure: Fastly, my employer, is mentioned.

This article explains five different phenomena that people mean when they say “technical debt”, and advocates understanding the full context rather than just assuming the folks that came before were fools.

/thanks Greg Burek

Kellan Elliott-McCrea

The work we did to get our teams aligned and our systems in good shape meant that we were able to scale, even with some services getting 40 times the normal traffic.

Kriton Dolias and Vinessa Wan — The New York Times

How does one resolve the emerging consensus for alerting exclusively on user-visible outages, with the undeniable need to learn about and react to things +before* users notice? Like a high cache eviction rate?

There’s a real gem in here, definitely worth a read.

Charity Majors (and Liz Fong-Jones in reply)

Being on-call will always involve getting woken up occasionally. But when that does happen, it should be for something that matters, and that the on-call person can make progress toward fixing.

Rachel Perkins — Honeycomb

Delayed replication can be used as a first resort to recover from accidental data loss and lends itself perfectly to situations where the loss-inducing event is noticed within the configured delay.

Andreas Brandl — GitLab

Outages

A production of Tinker Tinker Tinker, LLC Frontier Theme