General

SRE Weekly Issue #231

lex

August 10, 2020

I have a special treat for you this week: 7 detailed incident reports! Just a note, I’ll be on vacation next week, so I’ll see you in two weeks on August 23.

Articles

Improving Postmortems from Chores to Masterclass with Paul Osman

The lead SRE at Under Armour(!) has a ton of interesting things to share about how they do SRE. I love their approach to incident retrospectives that starts with 1:1 interviews with those involved.

Paul Osman — Under Armour (Blameless Summit)

About the Quay.io Outage: Post Mortem

A routine infrastructure maintenance had unintended consequences, saturating MySQL with excessive connections.

Daniel Messer — RedHat

The 2020 Midland County Dam Failure

This report details the complex factors that contributed to the failure of a dam in Michigan in May of this year.

Jason Hayes — Mackinac Center for Public Policy

Heroku Incident #2090 Follow-up

This incident involved a DNS failure in Heroku’s infrastrucure provider (presumably AWS).

Heroku

Theory vs. Practice: Learnings from a recent Hadoop incident

This incident at LinkedIn impacted multiple internal customers with varying requirements for durability and latency, making recovery complex.

Sandhya Ramu and Vasanth Rajamani — LinkedIn

GitHub Availability Report: July 2020

This report includes a description of an incident involving Kubernetes pods and an impaired DNS service.

Keith Ballinger — GitHub

Incident Report: Investigating an Incident That’s Already Resolved

In this report, Honeycomb describes how they investigated an incident from the prior week that their monitoring had missed.

Martin Holman — Honeycomb

Outages

Discord
- This one is notable because it involves a purported “noisy neighbor” situation in Google Cloud Platform.
Slack
Canon
Steam
Some sites loading slowly
Indeed
Fastly

SRE Weekly Issue #230

lex

August 2, 2020

General

Comments

View on sreweekly.com

Happy BTW: Wear a mask.

Articles

LaunchDarkly’s Evolution from Polling to Streaming

LaunchDarkly started off with a polling-based architecture and ultimately migrated to pushing deltas out to clients.

Dawn Parzych — LaunchDarkly

A simpler alternative to distributed tracing for troubleshooting

A brief overview of some problems with distributed tracing, along with a suggestion of another way involving AI.

Larry Lancaster — Zebrium

Google Cloud Issue Summary Classroom – 2020-07-07

This is Google’s post-incident report for their Google Classroom incident on July 7.

Introducing Domain-Oriented Microservice Architecture

Uber has long been a champion of microservices. Now, with several years of experience, they share the lessons they’ve learned and how they deal with some of the pitfalls.

Adam Gluck — Uber

Keeping PagerDuty Always On With Remote Incident Response

This article opens with an interesting description of what the Cloudflare outage looked like from PagerDuty’s perspective.

Dave Bresci — PagerDuty

Safe by design?

This post reflects on two distinct philosophies of safety:

the engineering design should ensure that the system is safe

design alone cannot ensure that the system is safe

Lorin Hochstein

All we can do is find problems

You can’t use availability metrics to inform you about whether your system is reliable enough, because they can only tell you if you have a problem.

Lorin Hochstein

Outages

Facebook, Instagram and WhatsApp
Fastly
- Also two PoP-specific incidents:
  - BOG
  - JNB
  Full disclosure: Fastly is my employer.
Heroku

SRE Weekly Issue #229

lex

July 26, 2020

General

Comments

View on sreweekly.com

Articles

“How could they be so stupid?”

More details have emerged about the Twitter break-in last week, leading some to utter the quote above. Here’s a take on how to see it as not being about “stupidity”.

Lorin Hochstein

Data Consistency Checks

The data in your database should be consistent… but then again, incidents shouldn’t happen, right? Slack accepts that things routinely go wrong with data at their scale, and they have framework and a set of tools to deal with it.

Paul Hammond and Samantha Stoller — Slack

Obstacles to Learning from Incidents

I learned a lot from this article. My favorite obstacle is “distancing through differencing”, e.g. “we would never have responded to an incident that way”.

Thai Wood — Learning from Incidents

You don’t need SRE. What you need is SRE.

[…] SRE, that is SRE as defined by Google, is not applicable for most organizations.

Sanjeev Sharma

Questionable Advice: “What’s the critical path?”

Expert advice on what questions to ask as you try to figure out what your critical path is (and why you would want to know what it is).

Charity Majors

Thinking About Your Humans With J. Paul Reed

This podcast episode was kind of like a preview of J. Paul Reed and Tim Heckman’s joint talk at https://srefromhome.com/. I love how they refer to the pandemic as a months-long incident, and point out that if you’re always in an incident then you’re never in an incident.

Julie Gunderson and Mandi Walls — Page it to the Limit

Rebuilding messaging: How we bootstrapped our platform

I love a good dual-write story. Here’s how LinkedIn transitioned to a new messaging storage mechanism.

Pradhan Cadabam and Jingxuan (Rex) Zhang — LinkedIn

Outages

Garmin
Snapchat
Tweetdeck
GGPoker
- GGPoker had issues during a World Series of Poker (WSOP) event.
Fastly (control plane)
- Full disclosure: Fastly is my employer.
Squarespace
- Squarespace had a rough week, with the following incidents:
  - July 21
  - July 22 (includes a detailed follow-up analysis)
  - July 24
  - July 24
Google Cloud Platform
- Several GCP components were impacted, including Layer 7 Load Balancers.

SRE Weekly Issue #228

lex

July 19, 2020

General

Comments

View on sreweekly.com

SRE From Home is back! It’s happening this Thursday, and I’ll be on the Ask an SRE panel answering your questions. And don’t miss the talks by lots of great folks, some of whom have had articles featured here previously!

Articles

Change Advisory Boards Don’t Work

They don’t. They just don’t.

[…] as deployments grow beyond a certain size it’s almost impossible to execute them successfully.

Alex Yates — Octopus Deploy

Google Cloud Issue Summary: Gmail 2020-06-30

Whoops, forgot to include this one last week.

On June 30, Google’s email delivery service was targeted in what we believe was an attempt to bypass spam classification. The result was delayed message processing and increased message queuing.

Postmortems and More With J. Paul Reed

My favorite part is the focus on blame awareness:

But it’s not enough to just be blameless—it’s also important to be blame-aware. Being blame-aware means that we are aware of our biases and how they may impact our ability to view an incident impartially.

Isabella Pontecorvo — PagerDuty

Keeping Customers Streaming — The Centralized Site Reliability Practice at Netflix

Netflix has a team dedicated to the overall reliability of their service.

Practically speaking, this includes activities such as systemic risk identification, handling the lifecycle of an incident, and reliability consulting.

Hank Jacobs– Netflix

What is SRE?

Another good reference if you’re looking to bootstrap SRE at your organization.

Rich Burroughs — FireHydrant

The Tail at Scale Approximation

Bill Duncan’s back with an easy and very close approximation for the “Tail at Scale” formula. The question it answers is: how many nines do you need on all of your backend microservices for X nines on the frontend?

Bill Duncan

The Essential List of Top SRE Resources

Tons of great links in here with enticing descriptions to make you want to read them. Includes books, tools, hiring, certification, and general SRE goodness.

Emily Arnot — Blameless

Advocating for a Product Mindset within Platform Teams and How We Do it at HelloTech (Part 1)

SRE is all about keeping the user experience working, and working with product-focused folks can really help. For more on this, check out my former coworker Jen Wohlner’s awesome SRECon19 talk on SRE & product management.

Samantha Coffman — HelloFresh

Outages

Cloudflare
- Cloudflare had a 50% drop in traffic served by their network subsequent to a BGP issue. Linked is their analysis including snippets of router configurations. Lots of services suffered contemporaneous outages possibly stemming from Cloudflare’s, including Discord, Postmates, Hosted Graphite, and DownDetector.John Graham-Cumming — Cloudflare
Twitter
- Twitter had a major security breach, and as part of their response, they temporarily cut off large parts of their service. Click for their post about what happened.
GitHub
WhatsApp
Hulu
Snapchat
Microsoft Outlook
- Notably, the outage involved the Outlook application that people run on their computer, not the cloud version.
Fastly
- Also a control plane incident later that day.Full disclosure: Fastly is my employer.

SRE Weekly Issue #227

lex

July 12, 2020

General

Comments

View on sreweekly.com

Articles

A Terrible, Horrible, No-Good, Very Bad Day at Slack

This is the first of a pair of articles this week on a major Slack outage in May. This one explores the technical side, with a lot of juicy details on what happened and how.

Laura Nolan — Slack

All Hands on Deck

This is the companion article that describes Slack’s incident response process, using the same incident as a case study.

Ryan Katkov — Slack

Improving Incident Retrospectives at Indeed

The author saw room for improvement in the retrospective process at Indeed. The article explains the recommendations they made and why, including de-emphasizing generation remediation items in favor of learning.

Alex Elman

Google Cloud Networking Incident #20005 Follow-Up

The datacenter was purposefully switched to generator power during planned power maintenance, but unfortunately the fuel delivery system failed.

Towards More Effective Incident Postmortems

This is a good primer on the ins and outs of running a post-incident analysis.

Anusuya Kannabiran — Squadcast

Setting SLOs: observability using custom metrics

This article goes through an interesting technique for setting up SLO metrics and alerts in GCP using Terraform and OpenCensus.

Cindy Quach — Google

Introducing the GitHub Availability Report

GitHub is committing to publishing a report on their availability each month with detail on incidents. This intro includes the reports for May and June with a description of 4 incidents.

Keith Ballinger — GitHub

Blameless’ SRE Journey

This is neat: Blameless transitioned from “startup mode” toward an SRE methodology, becoming customer 0 of their own product in the process.

Blameless

Outages

Facebook SDK
- Like in May, a Facebook SDK release caused problems on iOS for Spotify, Pinterest, Tinder.
Uber Eats
Crunchyroll
TikTok
Spotify

SRE Weekly Issue #231

Articles

Outages

SRE Weekly Issue #230

Articles

Outages

SRE Weekly Issue #229

Articles

Outages

SRE Weekly Issue #228

Articles

Outages

SRE Weekly Issue #227

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

General

A message from our sponsor, StackHawk:

Articles

Outages

A message from our sponsor, StackHawk:

Articles

Outages

A message from our sponsor, StackHawk:

Articles

Outages

A message from our sponsor, StackHawk:

Articles

Outages

A message from our sponsor, StackHawk:

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues