SRE Weekly Issue #166

SRECon was amazing! The talk line-up was mind-blowing, and it was great to meet many of you there. A big thanks to all the speakers for making this one a conference to remember.

A message from our sponsor, VictorOps:

In case you missed it, check out the recording of the recent VictorOps webinar, How to Make On-Call Suck Less. The webinar covers 5 actionable steps every SRE team can take to improve alerting and make on-call suck less:

http://try.victorops.com/sreweekly/making-on-call-suck-less

Articles

One of my favorite moments of SRECon: during their talk, Dorothy Jung and Wenting Wang unveiled this choose-your-own-adventure-style game for practicing your incident response skills. See if you can resolve the incident before your stress level gets too high!

Chie Shu, Dorothy Jung, Joel Salas, Dennis So, Sam Faber-Manning, and Wenting Wang — Yelp

Last week was only the second SRECon I’ve managed to attend. Rather than post raw notes from all the talks I attended, I tried something different: I only wrote down the really big stuff that made me think or blew my mind. I’m hoping that just reading this might give those of you that weren’t able to attend a taste of the conference.

Lex Neva

Inspired by SRECon, John Allspaw posted this Twitter thread on the “Humans Are Better At” / “Machines Are Better At” concept.

Who will argue with “make the computers do the easy/tedious stuff so humans can do the difficult/interesting stuff”? (apparently, I will)

John Allspaw

This article goes into what the pilots of the Lion Air 737 Max 8 (and presumably the Ethiopian Airlines one as well) would have had to do in order to regain control over the aircraft. We’re starting to get hints of the task saturation and alert overload both sets of pilots may have faced as they tried to handle the situation:

The Lion Air crew would have had to accomplish this while dealing with a host of alerts, including differences in other sensor data between the pilot and co-pilot positions that made it unclear what the aircraft’s altitude was.

Thanks to Courtney Eckhardt for this one.

Sean Gallagher — Ars Technica

The day before Lion Air’s 737 Max 8 crash last fall, the exact same plane had a similar failure to the one that may have taken that plane down the next day.

Thanks to Courtney Eckhardt for this one.

Alan Levin and Harry Suhartono — Bloomberg

Calvin is interesting for (at least) two reasons: first, it’s designed to work with an existing database, and second, it manages an impressively fast transaction throughput rate.

Adrian Colyer (summary) — The Morning Paper

Thomson et al. (original paper)

This article draws an interesting parallel between two talks at SRECon last week, about making sure that your monitoring doesn’t itself cause incidents.

Beth Pariseau — TechTarget

Outages

Updated: March 31, 2019 — 9:04 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme