SRE Weekly Issue #231

I have a special treat for you this week: 7 detailed incident reports! Just a note, I’ll be on vacation next week, so I’ll see you in two weeks on August 23.

A message from our sponsor, StackHawk:

Learn about StackHawk’s setup of Prometheus Metrics with SpringBoot & GRPC Services.
https://www.stackhawk.com/blog/prometheus-metrics-with-springboot-and-grpc-services?utm_source=SREWeekly

Articles

The lead SRE at Under Armour(!) has a ton of interesting things to share about how they do SRE. I love their approach to incident retrospectives that starts with 1:1 interviews with those involved.

Paul Osman — Under Armour (Blameless Summit)

A routine infrastructure maintenance had unintended consequences, saturating MySQL with excessive connections.

Daniel Messer — RedHat

This report details the complex factors that contributed to the failure of a dam in Michigan in May of this year.

Jason Hayes — Mackinac Center for Public Policy

This incident involved a DNS failure in Heroku’s infrastrucure provider (presumably AWS).

Heroku

This incident at LinkedIn impacted multiple internal customers with varying requirements for durability and latency, making recovery complex.

Sandhya Ramu and Vasanth Rajamani — LinkedIn

This report includes a description of an incident involving Kubernetes pods and an impaired DNS service.

Keith Ballinger — GitHub

In this report, Honeycomb describes how they investigated an incident from the prior week that their monitoring had missed.

Martin Holman — Honeycomb

Outages

Updated: August 10, 2020 — 10:07 am
A production of Tinker Tinker Tinker, LLC Frontier Theme