General

SRE Weekly Issue #232

A message from our sponsor, StackHawk:

Is your company adopting GraphQL? Adding security testing is simple. Watch this 20 minute walk through to see how easy it is to get up and running!

https://www.youtube.com/watch?v=–liu7LCs5A

Articles

An engineer’s observation of a really effective Incident Command pattern.

Dean Wilson

Here’s Lorin Hochstein’s take on the STAMP (Systems-Theoretic Accident Model and Processes) workshop he attended recently.

Lorin Hochstein

What’s the difference between Resilience Engineering and High Reliability Organizations? This paper (and excellent summary) explains.

Torgeir Haavik, Stian Antonsen, Ragnar Rosness, and Andrew Hale (original paper)

Thai Wood — Resilience Roundup (summary)

This one focuses on what I feel are really important parts of SRE, taken from the article’s subheadings:

  • Vendor engineering
  • Product engineering
  • Sociotechnical systems engineering
  • Managing the portfolio of technical investments

Charity Majors — Honeycomb

Now that’s a for-serious incident report. Nice one, folks! This is an interesting case of theory-meets-reality for disaster planning.

giles — PythonAnywhere

Outages

SRE Weekly Issue #231

I have a special treat for you this week: 7 detailed incident reports! Just a note, I’ll be on vacation next week, so I’ll see you in two weeks on August 23.

A message from our sponsor, StackHawk:

Learn about StackHawk’s setup of Prometheus Metrics with SpringBoot & GRPC Services.
https://www.stackhawk.com/blog/prometheus-metrics-with-springboot-and-grpc-services?utm_source=SREWeekly

Articles

The lead SRE at Under Armour(!) has a ton of interesting things to share about how they do SRE. I love their approach to incident retrospectives that starts with 1:1 interviews with those involved.

Paul Osman — Under Armour (Blameless Summit)

A routine infrastructure maintenance had unintended consequences, saturating MySQL with excessive connections.

Daniel Messer — RedHat

This report details the complex factors that contributed to the failure of a dam in Michigan in May of this year.

Jason Hayes — Mackinac Center for Public Policy

This incident involved a DNS failure in Heroku’s infrastrucure provider (presumably AWS).

Heroku

This incident at LinkedIn impacted multiple internal customers with varying requirements for durability and latency, making recovery complex.

Sandhya Ramu and Vasanth Rajamani — LinkedIn

This report includes a description of an incident involving Kubernetes pods and an impaired DNS service.

Keith Ballinger — GitHub

In this report, Honeycomb describes how they investigated an incident from the prior week that their monitoring had missed.

Martin Holman — Honeycomb

Outages

SRE Weekly Issue #230

Happy BTW: Wear a mask.

A message from our sponsor, StackHawk:

Add security testing to your CI pipelines with GitHub Actions. Check out this webinar recording (no email required) to learn how.
https://www.youtube.com/watch?v=W_7BxFgMYHs&time_continue=8

Articles

LaunchDarkly started off with a polling-based architecture and ultimately migrated to pushing deltas out to clients.

Dawn Parzych — LaunchDarkly

A brief overview of some problems with distributed tracing, along with a suggestion of another way involving AI.

Larry Lancaster — Zebrium

This is Google’s post-incident report for their Google Classroom incident on July 7.

Uber has long been a champion of microservices. Now, with several years of experience, they share the lessons they’ve learned and how they deal with some of the pitfalls.

Adam Gluck — Uber

This article opens with an interesting description of what the Cloudflare outage looked like from PagerDuty’s perspective.

Dave Bresci — PagerDuty

This post reflects on two distinct philosophies of safety:

the engineering design should ensure that the system is safe

design alone cannot ensure that the system is safe

Lorin Hochstein

You can’t use availability metrics to inform you about whether your system is reliable enough, because they can only tell you if you have a problem.

Lorin Hochstein

Outages

SRE Weekly Issue #229

A message from our sponsor, StackHawk:

Read about how to build test driven security with StackHawk + Travis CI + Docker Compose.
https://www.stackhawk.com/blog/test-driven-security-with-travis-ci-and-docker-compose?utm_source=SREWeekly

Articles

More details have emerged about the Twitter break-in last week, leading some to utter the quote above. Here’s a take on how to see it as not being about “stupidity”.

Lorin Hochstein

The data in your database should be consistent… but then again, incidents shouldn’t happen, right? Slack accepts that things routinely go wrong with data at their scale, and they have framework and a set of tools to deal with it.

Paul Hammond and Samantha Stoller — Slack

I learned a lot from this article. My favorite obstacle is “distancing through differencing”, e.g. “we would never have responded to an incident that way”.

Thai Wood — Learning from Incidents

[…] SRE, that is SRE as defined by Google, is not applicable for most organizations.

Sanjeev Sharma

Expert advice on what questions to ask as you try to figure out what your critical path is (and why you would want to know what it is).

Charity Majors

This podcast episode was kind of like a preview of J. Paul Reed and Tim Heckman’s joint talk at https://srefromhome.com/. I love how they refer to the pandemic as a months-long incident, and point out that if you’re always in an incident then you’re never in an incident.

Julie Gunderson and Mandi Walls — Page it to the Limit

I love a good dual-write story. Here’s how LinkedIn transitioned to a new messaging storage mechanism.

Pradhan Cadabam and Jingxuan (Rex) Zhang — LinkedIn

Outages

SRE Weekly Issue #228

SRE From Home is back! It’s happening this Thursday, and I’ll be on the Ask an SRE panel answering your questions. And don’t miss the talks by lots of great folks, some of whom have had articles featured here previously!

A message from our sponsor, StackHawk:

StackHawk is built on the open source ZAP application security scanner, the most widely used AppSec tool out there. Now the founder of ZAP has joined our team to bring AppSec to developers. Read all about it.
https://www.stackhawk.com/blog/zap-founder-decides-to-join-stackhawk?utm_source=SREWeekly

Articles

They don’t. They just don’t.

[…] as deployments grow beyond a certain size it’s almost impossible to execute them successfully.

Alex Yates — Octopus Deploy

Whoops, forgot to include this one last week.

On June 30, Google’s email delivery service was targeted in what we believe was an attempt to bypass spam classification. The result was delayed message processing and increased message queuing.

My favorite part is the focus on blame awareness:

But it’s not enough to just be blameless—it’s also important to be blame-aware. Being blame-aware means that we are aware of our biases and how they may impact our ability to view an incident impartially.

Isabella Pontecorvo — PagerDuty

Netflix has a team dedicated to the overall reliability of their service.

Practically speaking, this includes activities such as systemic risk identification, handling the lifecycle of an incident, and reliability consulting.

Hank Jacobs– Netflix

Another good reference if you’re looking to bootstrap SRE at your organization.

Rich Burroughs — FireHydrant

Bill Duncan’s back with an easy and very close approximation for the “Tail at Scale” formula. The question it answers is: how many nines do you need on all of your backend microservices for X nines on the frontend?

Bill Duncan

Tons of great links in here with enticing descriptions to make you want to read them. Includes books, tools, hiring, certification, and general SRE goodness.

Emily Arnot — Blameless

SRE is all about keeping the user experience working, and working with product-focused folks can really help. For more on this, check out my former coworker Jen Wohlner’s awesome SRECon19 talk on SRE & product management.

Samantha Coffman — HelloFresh

Outages

  • Cloudflare
    • Cloudflare had a 50% drop in traffic served by their network subsequent to a BGP issue. Linked is their analysis including snippets of router configurations. Lots of services suffered contemporaneous outages possibly stemming from Cloudflare’s, including Discord, Postmates, Hosted Graphite, and DownDetector.John Graham-Cumming — Cloudflare
  • Twitter
    • Twitter had a major security breach, and as part of their response, they temporarily cut off large parts of their service. Click for their post about what happened.
  • GitHub
  • WhatsApp
  • Hulu
  • Snapchat
  • Microsoft Outlook
    • Notably, the outage involved the Outlook application that people run on their computer, not the cloud version.
  • Fastly
A production of Tinker Tinker Tinker, LLC Frontier Theme