SRE Weekly Issue #193

Articles

Ever had a Sev 1 non-impacting incident? This team’s Consul cluster was balanced on a razor’s edge: one false move and quorum would be lost. Read about their incident response and learn how they avoided customer impact.

Devin Sylva — GitLab

Hot SRE trends in 2019 (brought to you from SREcon EMEA)

This SRECon EMEA highlight reel is giving me serious FOMO.

Will Sewell — Pusher

Resilience Roundup – Handoff strategies in settings with high consequences for failure: lessons for health care operations

This week we’re taking a look at how teams in high consequence domains perform handoffs between shifts.

Emily Patterson, Emilie Roth, David Woods, and Renee Chow (original paper)

Thai Wood (summary)

Scaling in the presence of errors—don’t ignore them

This is an interesting essay on handling errors in complex systems.

In other words, the trick to scaling in the presence of errors is building software around the notion of recovery. Automated recovery.

tef

Fast dimensional analysis for root cause analysis at scale

To be clear: this is about assisting incident responders in gaining an understanding of an incident in the moment, not about finding a “root cause” to present in an after-action report.

I’m not going to pretend to understand the math, but the concept is intriguing.

Nikolay Pavlovich Laptev, Fred Lin, Keyur Muzumdar, Mihai-Valentin Curelea, Seunghak Lee, and Sriram Sankar — Facebook

The inflection point hypothesis: a principled approach to finding the root cause of a failure

This one’s about assisting humans in debugging, when they have a reproduction case for a bug but can’t see what’s actually going wrong.

That’s two different uses of “root cause” this week, and neither one is the troublesome variety that John Allspaw has debunked repeatedly.

Zhang et al. (original paper)

Adrian Colyer (summary)

Outages

Honeycomb
- Here‘s an unroll of an interesting Twitter thread by Honeycomb’s Liz Fong-Jones during and after the incident.
GitHub
Amazon Prime Video
Google Compute Engine
- Network administration functions were impacted. Click for their post-incident analysis.
Squarespace
- On Wednesday November 6th, many Squarespace websites were unavailable for 102 minutes between 14:13 and 15:55 ET.
  
  Click through for their post-incident analysis.

SRE Weekly Issue #193

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

A message from our sponsor, VictorOps:

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues