General

SRE Weekly Issue #197

lex

December 8, 2019

It’s been four years since I started SRE Weekly. I’m having a ton of fun and learning a lot, and I can’t tell you all how happy it makes me that you read the newsletter.

A huge thank you to everyone who writes amazing SRE content every week. Without you folks, SRE Weekly would be nothing. Thanks also to everyone who sends links in — I definitely don’t catch every interesting article!

Articles

Why build an online community around “learning from incidents”?

Here’s an intro to the Learning From Incidents community. I can’t wait to see what these folks write. They’re coming out of the gate fast, with a post every day for the first week.

Nora Jones

OOPS! Learning from the incident you didn’t have

In order to understand how things went wrong, we need to first understand how they went right

I love the move toward using the term “operational surprise” rather than “incident”.

Lorin Hochstein

The Space Review: All in the family

Fascinating detail about the space shuttle Columbia’s accident, and the confusing jargon at NASA that may have contributed.

Dwayne A. Day — The Space Review

The Art of SLOs

Google released free material (slides, handbooks, worksheets) to help you run a workshop on effective SLOs.

Eliminating toil with fully automated load testing

Lots of really interesting detail about how LinkedIn routes traffic to datacenters and what happens when a datacenter goes down.

Nishant Singh — LinkedIn

Patience in Implementing Effective Incident Reviews

Our field is learning a ton, and it can be tempting to short-circuit that learning. It takes time to really grok and integrate what we’re learning.

Now it may be easy to accept all of this and think “Yeah yeah, I got it. Let me at that ‘resilience’. I’m going to ‘add so much resilience’ to my system!”.

Will Gallego

Shrinking the time to mitigate production incidents

I like the distinction between “unmanaged” and “untrained” incident response.Author: Jesus Climent — Google

Journey into Observability: Reading material

This chronicle of learning about observability makes for an excellent reading list to those just diving in.

Mads Hartmann

Outages

GitLab — Analysis of November 28th outage
- A change to roll out ip-tables to other non gitlab.com hosts was inadvertently applied to the database hosts. That change to host firewalling caused all web and api hosts to lose connectivity to the database. The change has been rolled back and we are now restarting host processes.
Disney+
Dexcom Diabetes Alerts
- Blood sugar monitors failed to send alerts for days. Parents use these monitors for monitoring their diabetic children’s blood sugar levels.
AOL Mail
DRS black-out during Abu Dhabi GP
- A Service failure prevented drivers from being allowed to use the Drag Reduction System (DRS).
Discord
- Related to the Google Compute Engine incident below.
  They also had another incident today.
Heroku incident #1930 followup
Heroku
Google Compute Engine
- High latency in IO operations to SSD-based persistent disks.

SRE Weekly Issue #196

lex

December 1, 2019

General

Comments

View on sreweekly.com

Articles

Seven tips to improve live streaming

My favorite:

Don’t wait until the post-mortem; consider doing a “pre-mortem” brainstorm to identify potential issues.

John Agger — Fastly

Full disclosure: Fastly is my employer.

How Let’s Encrypt Runs CT Logs

Let’s Encrypt deals with pretty heavy traffic. This post goes into what it takes for them to run a successful Certificate Transparency log.

Phil Porada — Let’s Encrypt

Impaired Air Traffic Controller

In this air traffic radio recording from Las Vegas (2018), the air traffic controller becomes impaired and starts issuing confusing and dangerously incorrect directives. The pilots work together to correct the situation and no accidents occur. This is a classic example of a resilient system.

The Requirements For Aftermath Projects

I don’t normally link to posts that heavily cover product offerings, but this one has some real gems. I especially like the discussion toward the end of the importance of analyzing an incident shortly after it happens.

John Allspaw — Adaptive Capacity Labs

Hospital alarms prove a noisy misery for patients: ‘I feel like I’m in jail.’

This is a striking analogue for an infrastructure with many unactionable alerts.

The commission has estimated that of the thousands of alarms going off throughout a hospital every day, an estimated 85 to 99 percent do not require clinical intervention.

Melissa Bailey — The Washington Post

Twitter: Dan McKinley on the history of Etsy

A fascinating look at the early days of Etsy, in which a system is rewritten, the rewrite blows up, the rewrite is rewritten, and finally that is rewritten again. Ouch.

Dan McKinley (@mcfunley)

Disaster Recovery Test Faking: Another Use Case for Stretched VLANs

If your DR test involves carefully contrived circumstances that don’t match the real world, then it’s not a real test. Point your upper management at this article if you need to argue for true DR testing.

Ivan Pepelnjak

Outages

GitHub
BNZ (bank)
Bank of Ireland
Rakuten
IndiGo (airline)
Tinder
Amino App
Twitter
Costco
Nordstrom Rack
Facebook and Instagram
- This one happened on the US’s Thanskgiving Day.
Telsa App
ABC News website
- An outage resulted in articles from 2011 being served to visitors.
Heroku
SquareSpace
NatWest Bank
- Thanks to Dr. Richard Cook for this one.

SRE Weekly Issue #195

lex

November 24, 2019

General

Comments

View on sreweekly.com

Articles

Observablility: Tabs vs. Spaces for Ops

An entertaining take on defining Observability.

Joshua Biggley

What makes a good runbook?

There are some really great tips in here, wrapped up in a handy mnemonic, the Five As:

actionable
accessible
accurate
authoritative
adaptable

Dan Moore — Transposit

Fastly improves delivery reliability with its fast path failover technology

“The Internet routes around damage”, right? Not always, and if it does, it’s often too slow. Fastly has a pretty interesting solution to that problem.

Lorenzo Saino and Raul Landa — Fastly

Full disclosure: Fastly is my employer.

Debugging network stalls on Kubernetes

The stalls were caused by a gnarly kernel performance issue. They had to use bcc and perf to dig into the kernel in order to figure out what was wrong.

Theo Julienne — GitHub

9 Reliability Talks at AWS re:Invent that Everyone Should Attend

Heading to Las Vegas for re:Invent? Here’s a handy guide of talks you might want to check out.

Rui Su — Blameless

Markers of Progress in Incident Analysis

How can you tell when folks are learning effectively from incident reviews? Hint: not by measuring MTTR and the like.

John Allspaw — Adaptive Capacity Labs

Outages

Honeycomb Incident Report: Running Dry on Memory Without Noticing
- A couple weeks ago, I covered a Honeycomb outage and linked to a tweet thread by one of their employees. Here’s their full analysis of the incident, including a mention of the Twitter thread.
  Liz Fong-Jones — Honeycomb
LetsEncrypt
Microsoft Azure
- Microsft posted this followup analysis of an issue with Azure’s edge network.
Netflix
British Airways
Microsoft 365, OneDrive, and SharePoint
Yahoo Mail
Heroku Incident #1927 Followup
Squarespace
- Also this one and this one.
GitHub
reddit

SRE Weekly Issue #194

lex

November 17, 2019

General

Comments

View on sreweekly.com

Articles

Errata for issue #193

Last week, I mistakenly listed an outage as “Connectivity Issues”, when it should have been attributed to Squarespace. Sorry about that!

Sleep, Interrupted: Niall Richard Murphy on Taking the Emergency Out of On-Call

From the authors of the new Post-Incident Review Zine comes this summary of Niall Murphy’s SRECon talk. It’s way more than a talk recap, tying together related blog posts and talks from other authors.

Jaime Woo and Emil Stolarsky

The silence of the racks is deafening, production gear has gone dark – so which wire do we cut?

They didn’t trust the datacenter’s backup power, so they added rack UPSes. Little did they realize that a single UPS failure could take down all of the rest.

Richard Speed — The Register

Taiji: managing global user traffic for large-scale Internet services at the edge

Taiji chooses which datacenter to route a Facebook user’s traffic to. It identifies clusters of users that have friended each other and routes them to the same place, on the theory that they’re likely to be interested in the same content.

Adrian Colyer (summary)

Xu et al., SOSP’19 (original paper)

What Tracking Down Missing TCP Keepalives Taught Me About Docker, Golang, and GitLab

<3 detailed debugging stories. TIL: Google Compute Engine’s network drops connections from its state table after 10 minutes with no packets.

Stan Hu — GitHub

Monitoring server applications with Vortex

Vortex is DropBox’s custom-built metrics system, designed for horizontal scalability. Find out why they rolled their own and learn how it works in this article that includes shiny diagrams.

Dave Zbarsky — DropBox

Magic Numbers and second guessing SLOs – why is 96% better than 95%?

How do we come up with our SLOs, anyway? This one puts me in mind of Will Gallego’s post on error budgets.

Dean Wilson (@unixdaemon)

Snap: a microkernel approach to host networking

A network stack in userland as an alternative to TCP/IP? Yup, that seems like a pretty Google thing to do.

Adrian Colyer (summary)

Marty et al., SOSP’19 (original paper)

Outages

Disney+ (streaming service)
- Disney’s new streaming service suffered a few hiccups due to unexpectedly high demand.
Codeanywhere
Google Nest
NFL Network (streaming service)
YouTube
Hulu
Heroku: followup for incident #1922
Heroku
Transferwise
Google Cloud Platform
- A problem with KMS impacted multiple services in several regions. Google’s detailed followup analysis is linked above.

SRE Weekly Issue #193

lex

November 10, 2019

General

Comments

View on sreweekly.com

Articles

The Consul outage that never happened

Ever had a Sev 1 non-impacting incident? This team’s Consul cluster was balanced on a razor’s edge: one false move and quorum would be lost. Read about their incident response and learn how they avoided customer impact.

Devin Sylva — GitLab

Hot SRE trends in 2019 (brought to you from SREcon EMEA)

This SRECon EMEA highlight reel is giving me serious FOMO.

Will Sewell — Pusher

Resilience Roundup – Handoff strategies in settings with high consequences for failure: lessons for health care operations

This week we’re taking a look at how teams in high consequence domains perform handoffs between shifts.

Emily Patterson, Emilie Roth, David Woods, and Renee Chow (original paper)

Thai Wood (summary)

Scaling in the presence of errors—don’t ignore them

This is an interesting essay on handling errors in complex systems.

In other words, the trick to scaling in the presence of errors is building software around the notion of recovery. Automated recovery.

tef

Fast dimensional analysis for root cause analysis at scale

To be clear: this is about assisting incident responders in gaining an understanding of an incident in the moment, not about finding a “root cause” to present in an after-action report.

I’m not going to pretend to understand the math, but the concept is intriguing.

Nikolay Pavlovich Laptev, Fred Lin, Keyur Muzumdar, Mihai-Valentin Curelea, Seunghak Lee, and Sriram Sankar — Facebook

The inflection point hypothesis: a principled approach to finding the root cause of a failure

This one’s about assisting humans in debugging, when they have a reproduction case for a bug but can’t see what’s actually going wrong.

That’s two different uses of “root cause” this week, and neither one is the troublesome variety that John Allspaw has debunked repeatedly.

Zhang et al. (original paper)

Adrian Colyer (summary)

Outages

Honeycomb
- Here‘s an unroll of an interesting Twitter thread by Honeycomb’s Liz Fong-Jones during and after the incident.
GitHub
Amazon Prime Video
Google Compute Engine
- Network administration functions were impacted. Click for their post-incident analysis.
Squarespace
- On Wednesday November 6th, many Squarespace websites were unavailable for 102 minutes between 14:13 and 15:55 ET.
  
  Click through for their post-incident analysis.

SRE Weekly Issue #197

Articles

Outages

SRE Weekly Issue #196

Articles

Outages

SRE Weekly Issue #195

Articles

Outages

SRE Weekly Issue #194

Articles

Outages

SRE Weekly Issue #193

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

General

A message from our sponsor, VictorOps:

Articles

Outages

A message from our sponsor, VictorOps:

Articles

Outages

A message from our sponsor, VictorOps:

Articles

Outages

A message from our sponsor, VictorOps:

Articles

Outages

A message from our sponsor, VictorOps:

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues