General

SRE Weekly Issue #233

A message from our sponsor, StackHawk:

Did you catch the GitLab Commit keynote by StackHawk Founder Joni Klippert? View on demand now to learn about how security got left behind, and what can be done to improve.
https://about.gitlab.com/events/commit/

Articles

In this post, I’ll share how we ensured that Meet’s available service capacity was ahead of its 30x COVID-19 usage growth, and how we made that growth technically and operationally sustainable by leveraging a number of site reliability engineering (SRE) best practices.

Samantha Schaevitz — Google

I love the concept of “battleshorts” just as much as I’ve been enjoying this series of articles analyzing STAMP.

Lorin Hochstein

Honeycomb had 5 incidents in just over a week, prompting not only their normal incident investigation process, but a meta-analysis of all five together.

Emily Nakashima — Honeycomb

Why is Chromium responsible for half of the DNS queries to the root nameservers? And why do they all return NXDOMAIN?

Matthew Thomas — APNIC

“That Moment” when your fire suppression system triggers and the fire department shows up. This is part war story and part description of incident response practices.

Ariel Pisetzky — Taboola

An overload in an internal blob storage system impacted many dependent services.

Google

Sharding as a service, now there’s an interesting idea.

Gerald Guo, Thawan Kooburat — Facebook

In Kubernetes Operators: Automating the Container Orchestration Platform, authors Jason Dobies and Joshua Wood describe an Operator as “an automated Site Reliability Engineer for its application.” Given an SRE’s multifaceted experience and diverse workload, this is a bold statement. So what exactly can the Operator do?

Emily Arnot — Blameless

Outages

SRE Weekly Issue #232

A message from our sponsor, StackHawk:

Is your company adopting GraphQL? Adding security testing is simple. Watch this 20 minute walk through to see how easy it is to get up and running!

https://www.youtube.com/watch?v=–liu7LCs5A

Articles

An engineer’s observation of a really effective Incident Command pattern.

Dean Wilson

Here’s Lorin Hochstein’s take on the STAMP (Systems-Theoretic Accident Model and Processes) workshop he attended recently.

Lorin Hochstein

What’s the difference between Resilience Engineering and High Reliability Organizations? This paper (and excellent summary) explains.

Torgeir Haavik, Stian Antonsen, Ragnar Rosness, and Andrew Hale (original paper)

Thai Wood — Resilience Roundup (summary)

This one focuses on what I feel are really important parts of SRE, taken from the article’s subheadings:

  • Vendor engineering
  • Product engineering
  • Sociotechnical systems engineering
  • Managing the portfolio of technical investments

Charity Majors — Honeycomb

Now that’s a for-serious incident report. Nice one, folks! This is an interesting case of theory-meets-reality for disaster planning.

giles — PythonAnywhere

Outages

SRE Weekly Issue #231

I have a special treat for you this week: 7 detailed incident reports! Just a note, I’ll be on vacation next week, so I’ll see you in two weeks on August 23.

A message from our sponsor, StackHawk:

Learn about StackHawk’s setup of Prometheus Metrics with SpringBoot & GRPC Services.
https://www.stackhawk.com/blog/prometheus-metrics-with-springboot-and-grpc-services?utm_source=SREWeekly

Articles

The lead SRE at Under Armour(!) has a ton of interesting things to share about how they do SRE. I love their approach to incident retrospectives that starts with 1:1 interviews with those involved.

Paul Osman — Under Armour (Blameless Summit)

A routine infrastructure maintenance had unintended consequences, saturating MySQL with excessive connections.

Daniel Messer — RedHat

This report details the complex factors that contributed to the failure of a dam in Michigan in May of this year.

Jason Hayes — Mackinac Center for Public Policy

This incident involved a DNS failure in Heroku’s infrastrucure provider (presumably AWS).

Heroku

This incident at LinkedIn impacted multiple internal customers with varying requirements for durability and latency, making recovery complex.

Sandhya Ramu and Vasanth Rajamani — LinkedIn

This report includes a description of an incident involving Kubernetes pods and an impaired DNS service.

Keith Ballinger — GitHub

In this report, Honeycomb describes how they investigated an incident from the prior week that their monitoring had missed.

Martin Holman — Honeycomb

Outages

SRE Weekly Issue #230

Happy BTW: Wear a mask.

A message from our sponsor, StackHawk:

Add security testing to your CI pipelines with GitHub Actions. Check out this webinar recording (no email required) to learn how.
https://www.youtube.com/watch?v=W_7BxFgMYHs&time_continue=8

Articles

LaunchDarkly started off with a polling-based architecture and ultimately migrated to pushing deltas out to clients.

Dawn Parzych — LaunchDarkly

A brief overview of some problems with distributed tracing, along with a suggestion of another way involving AI.

Larry Lancaster — Zebrium

This is Google’s post-incident report for their Google Classroom incident on July 7.

Uber has long been a champion of microservices. Now, with several years of experience, they share the lessons they’ve learned and how they deal with some of the pitfalls.

Adam Gluck — Uber

This article opens with an interesting description of what the Cloudflare outage looked like from PagerDuty’s perspective.

Dave Bresci — PagerDuty

This post reflects on two distinct philosophies of safety:

the engineering design should ensure that the system is safe

design alone cannot ensure that the system is safe

Lorin Hochstein

You can’t use availability metrics to inform you about whether your system is reliable enough, because they can only tell you if you have a problem.

Lorin Hochstein

Outages

SRE Weekly Issue #229

A message from our sponsor, StackHawk:

Read about how to build test driven security with StackHawk + Travis CI + Docker Compose.
https://www.stackhawk.com/blog/test-driven-security-with-travis-ci-and-docker-compose?utm_source=SREWeekly

Articles

More details have emerged about the Twitter break-in last week, leading some to utter the quote above. Here’s a take on how to see it as not being about “stupidity”.

Lorin Hochstein

The data in your database should be consistent… but then again, incidents shouldn’t happen, right? Slack accepts that things routinely go wrong with data at their scale, and they have framework and a set of tools to deal with it.

Paul Hammond and Samantha Stoller — Slack

I learned a lot from this article. My favorite obstacle is “distancing through differencing”, e.g. “we would never have responded to an incident that way”.

Thai Wood — Learning from Incidents

[…] SRE, that is SRE as defined by Google, is not applicable for most organizations.

Sanjeev Sharma

Expert advice on what questions to ask as you try to figure out what your critical path is (and why you would want to know what it is).

Charity Majors

This podcast episode was kind of like a preview of J. Paul Reed and Tim Heckman’s joint talk at https://srefromhome.com/. I love how they refer to the pandemic as a months-long incident, and point out that if you’re always in an incident then you’re never in an incident.

Julie Gunderson and Mandi Walls — Page it to the Limit

I love a good dual-write story. Here’s how LinkedIn transitioned to a new messaging storage mechanism.

Pradhan Cadabam and Jingxuan (Rex) Zhang — LinkedIn

Outages

A production of Tinker Tinker Tinker, LLC Frontier Theme