SRE Weekly Issue #231

View on sreweekly.com

I have a special treat for you this week: 7 detailed incident reports! Just a note, I’ll be on vacation next week, so I’ll see you in two weeks on August 23.

Articles

Improving Postmortems from Chores to Masterclass with Paul Osman

The lead SRE at Under Armour(!) has a ton of interesting things to share about how they do SRE. I love their approach to incident retrospectives that starts with 1:1 interviews with those involved.

Paul Osman — Under Armour (Blameless Summit)

About the Quay.io Outage: Post Mortem

A routine infrastructure maintenance had unintended consequences, saturating MySQL with excessive connections.

Daniel Messer — RedHat

The 2020 Midland County Dam Failure

This report details the complex factors that contributed to the failure of a dam in Michigan in May of this year.

Jason Hayes — Mackinac Center for Public Policy

Heroku Incident #2090 Follow-up

This incident involved a DNS failure in Heroku’s infrastrucure provider (presumably AWS).

Heroku

Theory vs. Practice: Learnings from a recent Hadoop incident

This incident at LinkedIn impacted multiple internal customers with varying requirements for durability and latency, making recovery complex.

Sandhya Ramu and Vasanth Rajamani — LinkedIn

GitHub Availability Report: July 2020

This report includes a description of an incident involving Kubernetes pods and an impaired DNS service.

Keith Ballinger — GitHub

Incident Report: Investigating an Incident That’s Already Resolved

In this report, Honeycomb describes how they investigated an incident from the prior week that their monitoring had missed.

Martin Holman — Honeycomb

Outages

Discord
- This one is notable because it involves a purported “noisy neighbor” situation in Google Cloud Platform.
Slack
Canon
Steam
Some sites loading slowly
Indeed
Fastly

SRE Weekly Issue #231

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

A message from our sponsor, StackHawk:

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues