View on sreweekly.com
I have a special treat for you this week: 7 detailed incident reports! Just a note, I’ll be on vacation next week, so I’ll see you in two weeks on August 23.
The lead SRE at Under Armour(!) has a ton of interesting things to share about how they do SRE. I love their approach to incident retrospectives that starts with 1:1 interviews with those involved.
Paul Osman — Under Armour (Blameless Summit)
A routine infrastructure maintenance had unintended consequences, saturating MySQL with excessive connections.
Daniel Messer — RedHat
This report details the complex factors that contributed to the failure of a dam in Michigan in May of this year.
Jason Hayes — Mackinac Center for Public Policy
This incident involved a DNS failure in Heroku’s infrastrucure provider (presumably AWS).
This incident at LinkedIn impacted multiple internal customers with varying requirements for durability and latency, making recovery complex.
Sandhya Ramu and Vasanth Rajamani — LinkedIn
This report includes a description of an incident involving Kubernetes pods and an impaired DNS service.
Keith Ballinger — GitHub
In this report, Honeycomb describes how they investigated an incident from the prior week that their monitoring had missed.
Martin Holman — Honeycomb
View on sreweekly.com
SRE From Home is back! It’s happening this Thursday, and I’ll be on the Ask an SRE panel answering your questions. And don’t miss the talks by lots of great folks, some of whom have had articles featured here previously!
They don’t. They just don’t.
[…] as deployments grow beyond a certain size it’s almost impossible to execute them successfully.
Alex Yates — Octopus Deploy
Whoops, forgot to include this one last week.
On June 30, Google’s email delivery service was targeted in what we believe was an attempt to bypass spam classification. The result was delayed message processing and increased message queuing.
My favorite part is the focus on blame awareness:
But it’s not enough to just be blameless—it’s also important to be blame-aware. Being blame-aware means that we are aware of our biases and how they may impact our ability to view an incident impartially.
Isabella Pontecorvo — PagerDuty
Netflix has a team dedicated to the overall reliability of their service.
Practically speaking, this includes activities such as systemic risk identification, handling the lifecycle of an incident, and reliability consulting.
Hank Jacobs– Netflix
Another good reference if you’re looking to bootstrap SRE at your organization.
Rich Burroughs — FireHydrant
Bill Duncan’s back with an easy and very close approximation for the “Tail at Scale” formula. The question it answers is: how many nines do you need on all of your backend microservices for X nines on the frontend?
Tons of great links in here with enticing descriptions to make you want to read them. Includes books, tools, hiring, certification, and general SRE goodness.
Emily Arnot — Blameless
SRE is all about keeping the user experience working, and working with product-focused folks can really help. For more on this, check out my former coworker Jen Wohlner’s awesome SRECon19 talk on SRE & product management.
Samantha Coffman — HelloFresh
- Cloudflare had a 50% drop in traffic served by their network subsequent to a BGP issue. Linked is their analysis including snippets of router configurations. Lots of services suffered contemporaneous outages possibly stemming from Cloudflare’s, including Discord, Postmates, Hosted Graphite, and DownDetector.John Graham-Cumming — Cloudflare
- Twitter had a major security breach, and as part of their response, they temporarily cut off large parts of their service. Click for their post about what happened.
- Microsoft Outlook
- Notably, the outage involved the Outlook application that people run on their computer, not the cloud version.