General

SRE Weekly Issue #235

A message from our sponsor, StackHawk:

Adding application security tests to your CI pipeline is simple. It typically takes <30 minutes to setup automated testing so you can be confident of your application’s security. Check out our onboarding guide to see how to get started.
https://www.stackhawk.com/blog/onboarding-guide?SREWeekly

Articles

This isn’t just another boring article about SLOs. There’s a ton of good stuff in here about why they moved to SLO-based alerts, too.

we’re hoping that by implementing SLOs – and alerting on them – we’ll be able to improve communication during incidents, reduce the toil on on-callers, and help improve our reliability in a way that’s meaningful to our users.

Mads Hartmann

Often, serendipity gets us out of an incident or makes it less severe.

Unless we treat this sort of activity as first class when looking at incidents, we won’t really understand how it can be that some incidents get resolved so quickly and some take much longer.

Lorin Hochstein

It’s your classic “replace the engines on a jet while flying it” story. My favorite part is how they recorded real traffic and played it at the old and new backend API to compare the JSON responses.

Rohan Dhruva and Ed Ballot — Netflix

Feature flags can help with load shedding and throttling, and feature flag activity can even be useful data that points to contributing factors.

Dawn Parzych — LaunchDarkly

Unimog uses a lot of really interesting techniques to balance layer 4 traffic, about which this article goes into in great detail.

David Wragg — Cloudflare

I like this idea: it’s like a normal canary, except that you only send it a copy of traffic and discard the result, so as to avoid impacting users.

David Hoa — LinkedIn

Outages

SRE Weekly Issue #234

Last Sunday, there was a major backbone Internet provider outage after I finished putting SRE Weekly together.  There were so many outages that I’m not even going to bother listing all of them in the Outages section.

A message from our sponsor, StackHawk:

Everyone talks about shifting security left, but in many cases, it isn’t happening. There is a better way with developer-centric application security testing.
https://www.stackhawk.com/blog/align-engineering-security-appsec-tests-in-ci?utm_source=SREWeekly

Articles

I love the way this article portrays SRE by placing less emphasis on specific skills and more on a holistic approach to reliability.

Emily Arnott — Blameless

Incident review is an important part of the organizational learning process, but it can be practiced in a way where the focus shifts away from learning to fixing.

John Carrol (original paper)

Thai Wood — Resilience Roundup (summary)

My latest adventures in (negligently) running sreweekly.com. It started with a surprise AWS bill, and then it got kinda weird…

Lex Neva

Deep technical details on a series of recent incidents involving Basecamp.

Troy Toman — Basecamp

Here’s why eyes-on-glass constant monitoring won’t help and can be actively harmful.

Charity Majors

In August, we experienced no incidents resulting in service downtime. This month’s GitHub Availability Report will dive into updates to the GitHub Status Page and provide follow-up details on how we’ve addressed the incident mentioned in July’s report.

Keith Ballinger — GitHub

Here are Cloudflare’s thoughts on what happened with Sunday’s Internet trouble.

Matthew Prince — Cloudflare

This is ThousandEyes’s analysis of the outage, which goes along similar lines to Cloudflare’s and includes a lot more detail.

Angelique Medina and Archana Kesavan — ThousandEyes

Outages

SRE Weekly Issue #233

A message from our sponsor, StackHawk:

Did you catch the GitLab Commit keynote by StackHawk Founder Joni Klippert? View on demand now to learn about how security got left behind, and what can be done to improve.
https://about.gitlab.com/events/commit/

Articles

In this post, I’ll share how we ensured that Meet’s available service capacity was ahead of its 30x COVID-19 usage growth, and how we made that growth technically and operationally sustainable by leveraging a number of site reliability engineering (SRE) best practices.

Samantha Schaevitz — Google

I love the concept of “battleshorts” just as much as I’ve been enjoying this series of articles analyzing STAMP.

Lorin Hochstein

Honeycomb had 5 incidents in just over a week, prompting not only their normal incident investigation process, but a meta-analysis of all five together.

Emily Nakashima — Honeycomb

Why is Chromium responsible for half of the DNS queries to the root nameservers? And why do they all return NXDOMAIN?

Matthew Thomas — APNIC

“That Moment” when your fire suppression system triggers and the fire department shows up. This is part war story and part description of incident response practices.

Ariel Pisetzky — Taboola

An overload in an internal blob storage system impacted many dependent services.

Google

Sharding as a service, now there’s an interesting idea.

Gerald Guo, Thawan Kooburat — Facebook

In Kubernetes Operators: Automating the Container Orchestration Platform, authors Jason Dobies and Joshua Wood describe an Operator as “an automated Site Reliability Engineer for its application.” Given an SRE’s multifaceted experience and diverse workload, this is a bold statement. So what exactly can the Operator do?

Emily Arnot — Blameless

Outages

SRE Weekly Issue #232

A message from our sponsor, StackHawk:

Is your company adopting GraphQL? Adding security testing is simple. Watch this 20 minute walk through to see how easy it is to get up and running!

https://www.youtube.com/watch?v=–liu7LCs5A

Articles

An engineer’s observation of a really effective Incident Command pattern.

Dean Wilson

Here’s Lorin Hochstein’s take on the STAMP (Systems-Theoretic Accident Model and Processes) workshop he attended recently.

Lorin Hochstein

What’s the difference between Resilience Engineering and High Reliability Organizations? This paper (and excellent summary) explains.

Torgeir Haavik, Stian Antonsen, Ragnar Rosness, and Andrew Hale (original paper)

Thai Wood — Resilience Roundup (summary)

This one focuses on what I feel are really important parts of SRE, taken from the article’s subheadings:

  • Vendor engineering
  • Product engineering
  • Sociotechnical systems engineering
  • Managing the portfolio of technical investments

Charity Majors — Honeycomb

Now that’s a for-serious incident report. Nice one, folks! This is an interesting case of theory-meets-reality for disaster planning.

giles — PythonAnywhere

Outages

SRE Weekly Issue #231

I have a special treat for you this week: 7 detailed incident reports! Just a note, I’ll be on vacation next week, so I’ll see you in two weeks on August 23.

A message from our sponsor, StackHawk:

Learn about StackHawk’s setup of Prometheus Metrics with SpringBoot & GRPC Services.
https://www.stackhawk.com/blog/prometheus-metrics-with-springboot-and-grpc-services?utm_source=SREWeekly

Articles

The lead SRE at Under Armour(!) has a ton of interesting things to share about how they do SRE. I love their approach to incident retrospectives that starts with 1:1 interviews with those involved.

Paul Osman — Under Armour (Blameless Summit)

A routine infrastructure maintenance had unintended consequences, saturating MySQL with excessive connections.

Daniel Messer — RedHat

This report details the complex factors that contributed to the failure of a dam in Michigan in May of this year.

Jason Hayes — Mackinac Center for Public Policy

This incident involved a DNS failure in Heroku’s infrastrucure provider (presumably AWS).

Heroku

This incident at LinkedIn impacted multiple internal customers with varying requirements for durability and latency, making recovery complex.

Sandhya Ramu and Vasanth Rajamani — LinkedIn

This report includes a description of an incident involving Kubernetes pods and an impaired DNS service.

Keith Ballinger — GitHub

In this report, Honeycomb describes how they investigated an incident from the prior week that their monitoring had missed.

Martin Holman — Honeycomb

Outages

A production of Tinker Tinker Tinker, LLC Frontier Theme