General

SRE Weekly Issue #165

As I write this, I’m headed to New York City for SRECon19 Americas, and I can’t wait!  If you’re there, come hit me up for some SRE Weekly swag, made using open source software.

A message from our sponsor, VictorOps:

Reducing MTTA and MTTR takes 5 simple steps. Check out this recent blog series, Reducing MTTA, to find 5 simple steps for improving incident response, lowering MTTA over time and making on-call suck less for DevOps and SRE teams:

http://try.victorops.com/sreweekly/reducing-mtta-alerts

Articles

As we discover more about the Boeing 737 MAX accidents, this author trolled through the ASRS database looking for related complaints.

Thanks to Greg Burek for this one.

James Fallows — The Atlantic

Learn about ASRS, the Aviation Safety Reporting System. Pilots and other aviation crew can report concerns anonymously, and the results are summarized regularly and reported to the FAA, NTSB, and other organizations.

Thanks to Greg Burek for this one.

Jerry Colen — NASA

I caught wind of a previous Boeing 737 issue from the 90s during a personal conversation this week. There’s an interesting parallel to the current 737 MAX issue, as Boeing blamed pilots for incorrectly responding to a “normal” flight incident for which pilots are routinely trained.

Various — Wikipedia

Dr Justine Jordan gives a personal account of how on-duty napping during extended overnight in-hospital duty hours as a trainee doctor eased her fatigue levels and raised her state of alertness

Dr. Justine Jordan — Irish Medical Times

circuit breakers aren’t great because server depends on clients to be configured correctly. throttling server-side is better

Circuit-breakers are great, but the service depends on the clients to be configured correctly. A server-side rate-limiting solution is more robust.

Michael Cartmell — Grab

The concept of an ACL-based authorization system is simple enough, but can be a challenge to maintain at scale.

Michael Leong — LinkedIn

We can tell one thing from the outside: it wasn’t a BGP issue.

Alec Pinkham — AppNeta

Outages

SRE Weekly Issue #164

A message from our sponsor, VictorOps:

Start making on-call suck less. Last chance to register for the free VictorOps webinar where you can learn about using automation and improved collaboration to create a better on-call experience:

http://try.victorops.com/sreweekly/how-to-make-on-call-suck-less

Articles

I previously shared an article about the 737 MAX 8, and I’m truly saddened that another accident has occurred. Learning from accidents like this is incredibly important, and the NTSB is among the best at it. I look forward to see what we can take away from this to make air travel even safer.

Farnoush Amiri and Ben Kesslen — NBC

The existence of this anonymous channel for pilots is really interesting to me. It sounds like a great way to learn about near misses, which can be nearly identical to catastrophic accidents. Can we implement this kind of anonymous channel in our organizations too?

Thom Patterson and Aaron Cooper — CNN

“Aviation accidents are rarely the result of a single cause,” Lewis noted. “There are often many small things that lead to a crash, and that’s why these investigations take so long.”

Francesca Paris — NPR

Google and other companies are working on their own private undersea cables.

‘People think that data is in the cloud, but it’s not. It’s in the ocean.’

Adam Satariano — New York Times

For this week, I have an article about on-call and how its done at NASA. Many of the conclusions here may not be that surprising to those who have been on-call for any length of time, but I think there is a lot to learn from how NASA makes the system work.

Thai Wood — Resilience Roundup (summary)

Emily S Patterson and David D Woods — Ohio State University (original article)

I hadn’t thought of this before, but I really like this idea:

The facilitator’s role in the meeting is different from the other participants. They do not voice their own ideas, but keep the discussion on track and encourage the group to speak up.

Rachael Byrne — PagerDuty

Outages

SRE Weekly Issue #163

A message from our sponsor, VictorOps:

Being on-call sucks. To make it better, sign up for the free webinar, “How to Make On-Call Suck Less”, to learn 5 simple steps you can take to improve the on-call experience and become a more efficient SRE team:

http://try.victorops.com/sreweekly/how-to-make-on-call-suck-less

Articles

Using an NTSB report on an airplane crash as a case study, this video presents three common traps we fall into in incident retrospectives:

  1. Counterfactual reasoning
  2. Normative language
  3. Mechanistic reasoning

I want to make this required material for all retrospective participants.

Dr. Johan Bergström — Lund University

Peak-shifting can save you and your customers money and make load easier to handle.

Lara PuReum Yim, Prashant Kumar, Raghav Garg, Preeti Kotamarthi, Ajmal Afif, Calvin Ng Tjioe, and Renrong Weng — Grab

These folks structured their on-call and incident response process around wombats (and sound guidelines and playbooks).

Wes Mason — npm

Lots of great stuff in this case study on an incident involving Chef and Apache. My favorite:

Enforcing processes arbitrarily as a way of trying to increase robustness often ends up making systems more fragile, as people will find ways to work around processes that frustrate their goals

Ryn Daniels — HashiCorp

Here’s how and why Grab rebuilt their logging pipeline around structured JSON events.

Aditya Praharaj — Grab

Don Miguel Ruiz’s Four Agreements as applied to incident response:

  1. Be Impeccable With Your Word
  2. Don’t Take Anything Personally
  3. Don’t Make Assumptions
  4. Always Do Your Best

Matt Stratton — PagerDuty

Outages

SRE Weekly Issue #162

A message from our sponsor, VictorOps:

Ever been on-call? Then you know it can suck. Check out some of our tips and tricks to see how SRE teams are maintaining composure during a critical incident and making on-call suck less:

http://try.victorops.com/sreweekly/on-call-stress-management

Articles

Want to nerd out on BGP? Check out how this person modeled the Eve Online universe as an 8000-VM cluster running BGP.

Ben Cartwright-Cox

Accrued vacation time is antiquated, and “unlimited” vacation paradoxically leads employees to take less time overall. Time to enforce vacations, lest we forget that burnout is a reliability risk.

Baron Schwartz

How to avoid catastrophe: pay attention to near misses. This article makes an incredibly compelling point that we need to make a conscious effort to pay attention to near misses, and explains how cognitive bias will tend to make us do the exact opposite.

Catherine H. Tinsley, Robin L. Dillon, and Peter M. Madsen — Harvard Business Review

An intro to how blame causes problems, why blamelessness is better, and how to adopt a blameless culture.

Ashar Rizqi

A 100-year-old chemical company thought they had a great safety record. Turns out that folks were just considering accidents “routine” and not reporting them.

Thai Wood (reviewing a paper by Stefanie Huber, Ivette van Wijgerden, Arjan de Witt, and Sidney W.A. Dekker)

Booking.com has 50,000 servers and many SRE squads. They developed tools they call the Reliability Collaboration Model and the Ownership Map to help them define which products SRE squads support and at what level.

Emmanuel Goossaert — Booking.com

Outages

  • New Relic
  • Duo Security
  • Amtrak (US long-distance passenger rail)
    • Amtrak had an outage of its switching system this past week. Linked above is an article with the inflammatory title, “Human error? Try abject stupidity, Amtrak”. Exercise: try to think of ways in which this is not a case of abject stupidity.

      Rich Miller — Capitol Fax

  • YouTube

SRE Weekly Issue #161

A message from our sponsor, VictorOps:

Being on-call can suck. Without alert context or a collaborative system for incident response, SRE teams will have trouble appropriately responding to on-call incidents. Check out The On-Call Template to become the master of on-call and improve service reliability:

http://try.victorops.com/sreweekly/the-on-call-template

Articles

I’m not a fan of error budgets. I’ve never seen them implemented particularly well up close, though I know lots of folks who say it works for them.

I’ve started to feel a bit sour on the whole error budget thing, but I couldn’t really pin down why. This article really nails it.

Will Gallego

Will Gallego is my co-worker, although I came across this article separately.

I’m still hooked on flight accident case studies. In this one, mission fixation and indecision lead to disaster.

Air Safety Institute

If I was setting up curriculum at a university I’d make an entire semester-long class on The Challenger disaster, and make it required for any remotely STEM-oriented major.

This awesome article is about getting so used to pushing the limits that you forget you’re even doing it, until disaster strikes.

Foone Turing

A couple weeks back, I linked to a survey about compensation for on-call. Here’s an analysis of the results and some raw data in case you want to tinker with it.

Chris Evans and Spike Lindsey

Learn how this company does incident management drills. They seem to handle things much like a real incident, including doing a retrospective afterward!

Tim Little — Kudos

Outages

SRE WEEKLY © 2015 Frontier Theme