SRE Weekly Issue #234

Last Sunday, there was a major backbone Internet provider outage after I finished putting SRE Weekly together.  There were so many outages that I’m not even going to bother listing all of them in the Outages section.

A message from our sponsor, StackHawk:

Everyone talks about shifting security left, but in many cases, it isn’t happening. There is a better way with developer-centric application security testing.
https://www.stackhawk.com/blog/align-engineering-security-appsec-tests-in-ci?utm_source=SREWeekly

Articles

I love the way this article portrays SRE by placing less emphasis on specific skills and more on a holistic approach to reliability.

Emily Arnott — Blameless

Incident review is an important part of the organizational learning process, but it can be practiced in a way where the focus shifts away from learning to fixing.

John Carrol (original paper)

Thai Wood — Resilience Roundup (summary)

My latest adventures in (negligently) running sreweekly.com. It started with a surprise AWS bill, and then it got kinda weird…

Lex Neva

Deep technical details on a series of recent incidents involving Basecamp.

Troy Toman — Basecamp

Here’s why eyes-on-glass constant monitoring won’t help and can be actively harmful.

Charity Majors

In August, we experienced no incidents resulting in service downtime. This month’s GitHub Availability Report will dive into updates to the GitHub Status Page and provide follow-up details on how we’ve addressed the incident mentioned in July’s report.

Keith Ballinger — GitHub

Here are Cloudflare’s thoughts on what happened with Sunday’s Internet trouble.

Matthew Prince — Cloudflare

This is ThousandEyes’s analysis of the outage, which goes along similar lines to Cloudflare’s and includes a lot more detail.

Angelique Medina and Archana Kesavan — ThousandEyes

Outages

SRE Weekly Issue #233

A message from our sponsor, StackHawk:

Did you catch the GitLab Commit keynote by StackHawk Founder Joni Klippert? View on demand now to learn about how security got left behind, and what can be done to improve.
https://about.gitlab.com/events/commit/

Articles

In this post, I’ll share how we ensured that Meet’s available service capacity was ahead of its 30x COVID-19 usage growth, and how we made that growth technically and operationally sustainable by leveraging a number of site reliability engineering (SRE) best practices.

Samantha Schaevitz — Google

I love the concept of “battleshorts” just as much as I’ve been enjoying this series of articles analyzing STAMP.

Lorin Hochstein

Honeycomb had 5 incidents in just over a week, prompting not only their normal incident investigation process, but a meta-analysis of all five together.

Emily Nakashima — Honeycomb

Why is Chromium responsible for half of the DNS queries to the root nameservers? And why do they all return NXDOMAIN?

Matthew Thomas — APNIC

“That Moment” when your fire suppression system triggers and the fire department shows up. This is part war story and part description of incident response practices.

Ariel Pisetzky — Taboola

An overload in an internal blob storage system impacted many dependent services.

Google

Sharding as a service, now there’s an interesting idea.

Gerald Guo, Thawan Kooburat — Facebook

In Kubernetes Operators: Automating the Container Orchestration Platform, authors Jason Dobies and Joshua Wood describe an Operator as “an automated Site Reliability Engineer for its application.” Given an SRE’s multifaceted experience and diverse workload, this is a bold statement. So what exactly can the Operator do?

Emily Arnot — Blameless

Outages

SRE Weekly Issue #232

A message from our sponsor, StackHawk:

Is your company adopting GraphQL? Adding security testing is simple. Watch this 20 minute walk through to see how easy it is to get up and running!

https://www.youtube.com/watch?v=–liu7LCs5A

Articles

An engineer’s observation of a really effective Incident Command pattern.

Dean Wilson

Here’s Lorin Hochstein’s take on the STAMP (Systems-Theoretic Accident Model and Processes) workshop he attended recently.

Lorin Hochstein

What’s the difference between Resilience Engineering and High Reliability Organizations? This paper (and excellent summary) explains.

Torgeir Haavik, Stian Antonsen, Ragnar Rosness, and Andrew Hale (original paper)

Thai Wood — Resilience Roundup (summary)

This one focuses on what I feel are really important parts of SRE, taken from the article’s subheadings:

  • Vendor engineering
  • Product engineering
  • Sociotechnical systems engineering
  • Managing the portfolio of technical investments

Charity Majors — Honeycomb

Now that’s a for-serious incident report. Nice one, folks! This is an interesting case of theory-meets-reality for disaster planning.

giles — PythonAnywhere

Outages

SRE Weekly Issue #231

I have a special treat for you this week: 7 detailed incident reports! Just a note, I’ll be on vacation next week, so I’ll see you in two weeks on August 23.

A message from our sponsor, StackHawk:

Learn about StackHawk’s setup of Prometheus Metrics with SpringBoot & GRPC Services.
https://www.stackhawk.com/blog/prometheus-metrics-with-springboot-and-grpc-services?utm_source=SREWeekly

Articles

The lead SRE at Under Armour(!) has a ton of interesting things to share about how they do SRE. I love their approach to incident retrospectives that starts with 1:1 interviews with those involved.

Paul Osman — Under Armour (Blameless Summit)

A routine infrastructure maintenance had unintended consequences, saturating MySQL with excessive connections.

Daniel Messer — RedHat

This report details the complex factors that contributed to the failure of a dam in Michigan in May of this year.

Jason Hayes — Mackinac Center for Public Policy

This incident involved a DNS failure in Heroku’s infrastrucure provider (presumably AWS).

Heroku

This incident at LinkedIn impacted multiple internal customers with varying requirements for durability and latency, making recovery complex.

Sandhya Ramu and Vasanth Rajamani — LinkedIn

This report includes a description of an incident involving Kubernetes pods and an impaired DNS service.

Keith Ballinger — GitHub

In this report, Honeycomb describes how they investigated an incident from the prior week that their monitoring had missed.

Martin Holman — Honeycomb

Outages

SRE Weekly Issue #230

Happy BTW: Wear a mask.

A message from our sponsor, StackHawk:

Add security testing to your CI pipelines with GitHub Actions. Check out this webinar recording (no email required) to learn how.
https://www.youtube.com/watch?v=W_7BxFgMYHs&time_continue=8

Articles

LaunchDarkly started off with a polling-based architecture and ultimately migrated to pushing deltas out to clients.

Dawn Parzych — LaunchDarkly

A brief overview of some problems with distributed tracing, along with a suggestion of another way involving AI.

Larry Lancaster — Zebrium

This is Google’s post-incident report for their Google Classroom incident on July 7.

Uber has long been a champion of microservices. Now, with several years of experience, they share the lessons they’ve learned and how they deal with some of the pitfalls.

Adam Gluck — Uber

This article opens with an interesting description of what the Cloudflare outage looked like from PagerDuty’s perspective.

Dave Bresci — PagerDuty

This post reflects on two distinct philosophies of safety:

the engineering design should ensure that the system is safe

design alone cannot ensure that the system is safe

Lorin Hochstein

You can’t use availability metrics to inform you about whether your system is reliable enough, because they can only tell you if you have a problem.

Lorin Hochstein

Outages

A production of Tinker Tinker Tinker, LLC Frontier Theme