General

SRE Weekly Issue #281

lex

August 1, 2021

Articles

The incident: a formula 1 car hit the side barrier just over 20 minutes before the race was about to start. The team sprang into action with an incredibly calm, orderly and speedy incident response to replace the damaged parts faster than they ever have before.

This article is a great analysis, and there’s also an excellent 8-minute video that I highly recommend. Listen to the way the sporting director and everyone else communicates so calmly. It’s a rare treat to get video footage of a production incident like this.

Chris Evans — incident.io

Observe a Service; Not a Server

The underlying components become the cattle, and the services become the new Pet that you tend to with your utmost care.

Piyush Verma — Last9

aws-samples/aws-incident-response-playbooks

AWS posted these example/template incident response playbooks for customers to use in their incident response process.

Aws

(All) DNS Resource Records

A list with descriptions of all DNS record types, even the obscure ones. Tag yourself, I’m HIP.

Jan Schaumann

What’s a Major Incident Anyway?

This one includes a useful set of questions to prompt you as you develop your incident response and classification process.

Hollie Whitehead — xMatters

How to be better, together

The author of this article shows us how they communicate actively, perform incident retrospectives, and even discuss “near misses” and normal work in order to better learn how their system works — all skills that apply directly to SRE.

Jason Koppe — Learning From Incidents

The Unique Reliability Engineering Requirements of Microservices

Although the fundamental concepts of site reliability engineering are the same in any environment, SREs must adapt practices to different technologies, like microservices.

JJ Tang — Rootly

It’s Time to Rethink Outage Reports

This one uses Akamai’s incident report from their July 22 major outage as a jumping-off point to discuss openness in incident reports. The text of Akamai’s incident report is included in full.

Geoff Huston — CircleID

Culture & Conduct Risk: The Normalization of Deviance

Drawing from the “normalization of deviance” concept introduced in the Challenger disaster study [Diane Vaughan], this article explores the idea of studying your organization culture to catch problems early, rather than waiting to respond after they happen.

Stephen Scott

Lorin Hochstein (Netflix) [StaffEng Podcast]

This episode of the StaffEng Podcast is an interview with Lorin Hochstein, whose writings I’ve featured here numerous times. My favorite part of this episode is when they talk about doing incident analysis for near misses. One of the hosts points out that it’s much easier for folks to talk about what happened, because there was no incident so they’re not worried about being blamed.

David Noël-Romas and Alex Kessinger– StaffEng Podcast

Outages

Let’s Encrypt
Snapchat
Wikipedia
- To fact-check this one, I looked at their grafana dashboard. Neat!
Netflix
Venmo
Blackboard Learn
eBay
reddit

SRE Weekly Issue #280

lex

July 25, 2021

General

Comments

View on sreweekly.com

Articles

The Harmful Consequences of the Robustness Principle

The Robustness Principle (“be conservative in what you send, and liberal in what you accept”) has its uses, but it may not be best for the development of mature protocols, according to this IETF draft.

Martin Thomson

No, we don’t use Kubernetes

Docker without Kubernetes, does it make sense? These folks have a well-reasoned argument explaining why Kubernetes is not for them.

Maik Zumstrull — Ably

Personal data breach reporting for service outages (such as when your CDN is down)

Can a service outage unrelated to security count as a “personal data breach” in terms of GDPR and other regulations? If you agree with this article’s logic, then maybe it can.

Neil Brown

When You Do DevSecOps, Don’t Forget the SREs

The interactions between security and reliability incidents can be complex and hard to navigate. The example scenarios in this article really made me think.

Quentin Rousseau — Rootly

Solving the Three Stooges Problem

To deal with thundering herds, reddit implements caching in front of each of its microservices.

Raj Shah — reddit

What’s allowed to count as a cause?

Incident causes are a social construct, and it may be that your organizational structure prevents something from being counted as a cause.

Lorin Hochstein

IC1 Reliability Engineer – Dropbox Engineering Career Framework

Check it out, Dropbox publicly released their SRE career ladder.

Dropbox

Incidents, Response, and the People With Tim Nicholas

There’s a moment halfway through this episode of Page It to the Limit where they talk about blamelessness. If you just tell people to “do blameless postmortems”, but you don’t tell them how, then they’ll be afraid to talk about anything people did, and that will hamper learning.

Julie Gunderson, with guestTim Nicholas — Page It to the Limit

Migrating Facebook to MySQL 8.0

This was a monumental task, considering the 1000+(!!) internal code patches they had to port from MySQL 5.6 to 8.0.

Herman Lee, Pradeep Nayak — Facebook

Outages

Akamai
- Akamai had what they’re calling an “Edge DNS Service Incident”. It made headlines this week because it took down many of their customers, similar to the Akamai incident last month.
Let’s Encrypt
Disney park-related apps
Heroku

SRE Weekly Issue #279

lex

July 18, 2021

General

Comments

View on sreweekly.com

Articles

Managing the Risk of Cascading Failure

This is a presentation by Laura Nolan (with text transcript) all about cascading failure, what causes it, how to avoid it, and how to deal with it when it happens.

I love how succinct this is:

[…] in any system where we design to fail over, so any mechanism at all that redistributes load from a failed component to still working components, we create the potential for a cascading failure to happen.

Laura Nolan — Slack (presented at InfoQ)

The greedy exec trap

It’s so easy to explain an incident by describing how management could have prevented it from investing additional resources.

Lorin goes on to explain the “trap” part: it’s easy to stop investigating an incident too soon and declare the cause “greedy executives”, preventing us from learning more.

Lorin Hochstein

r/WallStreetBets Incident Anthology (What Worked Edition): Recently Consumed

They redesigned one of their caching systems in 2020, and it paid off handsomely during the GameStop saga. This article discusses the redesign and considers what would have happened without it.

Garrett Hoffman — Reddit

Pragmatic Incident Response: 3 Lessons Learned from Failures

The lessons are:

Do retrospectives for small incidents first.
Do a retrospective soon after the incident.
Alert on the user experience.

All great advice, and #1 is an interesting idea I hadn’t heard before.

Robert Ross — FireHydrant

De-Siloing Incident Management: How to Make Reliability Engineering Everyone’s Job

We can’t engineer reliability in a vacuum. This is a great explainer on how SRE siloing happens, the problems it causes, and how to break SRE out of its shell.

JJ Tang — Rootly

CALLBACK 498, July 2021 – Aircrew Resilience

This ASRS (Aviation Safety Reporting System) Callback issue has some real-world examples of resilient systems in action.

Nasa Asrs

Automatic Remediation of Kubernetes Nodes

Facing a common kubernetes node failure modes, Cloudflare uses open source tools (one published by them) to perform automatic restarts.

In the past 30 days, we’ve used the above automatic node remediation process to action 571 nodes. That has saved our humans a considerable amount of time.

Andrew DeMaria — Cloudflare

Outages

SRE Weekly Issue #278

lex

July 11, 2021

General

Comments

View on sreweekly.com

Articles

That Sinking Feeling (The #HugOps Song)

Whoa. This is the best thing ever. I feel like I want to make this the official theme song of SRE Weekly.

Forrest Brazeal

r/WallStreetBets Incident Anthology (What Worked Edition): Autoscaler

Their auto-scaling algorithm needed a tweak. Before: scale up by N instances. After: scale up by an amount proportional to the current number of instances.

Fran Garcia — Reddit

The Incident Review: 4 Incidents in Outer Space

here’s a look at incidents and reliability challenges that have occurred in outer space, and what SREs stand to learn from them.

JJ Tang — Rootly

Prepare for overnight success — with the right load testing approach

This one includes 3 key things to remember while load testing. My favorite: test the whole system, not just parts.

Cortex

4 ways to improve your influence as an SRE

SRE is as much about building consensus and earning buy-in as it is about actual engineering.

Cortex

NoOps: What Does the Future Hold for DevOps Engineers?

The definition of NoOps in this article is more clear than others I’ve seen. It’s not about firing your operations team — their skill set is still necessary.

Kentaro Wakayama

Systems Observability

Even though I know what observability is, I got a lot out of this article. It has some excellent examples of questions that are hard to answer with traditional dashboards, and includes my new favorite term:

The industrial term for this problem is Watermelon Metrics; A situation where individual dashboards look green, but the overall performance is broken and red inside.

Nishant Modak and Piyush Verma — Last9

Controlling a process we don’t understand

Instead, we should consider the fields there where practitioners are responsible for controlling a dynamic process that’s too complex for humans to fully understand.

Lorin Hochstein

Troubleshooting: A journey into the unknown

In this epic troubleshooting story, a weird curl bug coupled with Linux memory tuning parameters led to unexpected CPU consumption in an unrelated process.

Pavlos Parissis — Booking.com

How Back Market SREs prepared for Black Friday

Learning a lesson from a rough Black Friday in 2019, these folks used load testing to gather hard data on how they would likely fare in 2020.

Mathieu Garstecki — Back Market

Outages

SRE Weekly Issue #277

lex

July 4, 2021

General

Comments

View on sreweekly.com

Articles

FINRA Orders Record Financial Penalties Against Robinhood Financial LLC

Remember all those Robinhood outages? The US financial regulatory agency is making Robinhood repay folks for the losses they sustained as a result and also fining them for other reasons.

Michelle Ong, Ray Pellecchia, Angelita Plemmer Williams, and Andrew DeSouza — FINRA

r/WallStreetBets Incident Anthology: More Data, More Problems

This is brilliant and I wish I’d thought of it years ago:

One of the things we’ve previously seen during database incidents is that a set of impacted tables can provide a unique fingerprint to identify a feature that’s triggering issues.

Courtney Wang — Reddit

The Deeper Root Cause of the Fastly and Akamai Outages

The suggested root cause involves consolidation in cloud providers and the importance of DNS.

Alban Kwan — CircleID

Full disclosure: Fastly, my employer, is mentioned.

The normalization of deviance in healthcare delivery

This paper is about recognizing normalization of deviance and techniques for dealing with it. This tidbit really made me think:

[…] they might have been taught a system deviation without realizing that it was so […]

Bus Horiz

Elephant in the Blameless War Room: Accountability

Blameless incident analysis is often at odds with a desire to “hold people accountable”. This article explores that conflict and techniques for managing the needs involved.

Christina Tan and Emily Arnott — Blameless

Shipping on a Spent Error Budget

What can you do if you’re out of error budget but you still want to deliver new features? Get creative.

Paul Osman — Honeycomb

The SRE Incident Response game

I am going to go through the variation we use to up skill our on-call engineers we called “The Kobayashi Maru”, the name we borrowed from the Star Trek training exercise to test the character of Starfleet cadets.

Bruce Dominguez

SRE Weekly Issue #281

Articles

Outages

SRE Weekly Issue #280

Articles

Outages

SRE Weekly Issue #279

Articles

Outages

SRE Weekly Issue #278

Articles

Outages

SRE Weekly Issue #277

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

General

A message from our sponsor, StackHawk:

Articles

Outages

A message from our sponsor, StackHawk:

Articles

Outages

A message from our sponsor, StackHawk:

Articles

Outages

A message from our sponsor, StackHawk:

Articles

Outages

A message from our sponsor, StackHawk:

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues