SRE WEEKLY – Page 39 – scalability, availability, incident response, automation

SRE Weekly Issue #321

lex

May 8, 2022

Articles

Using Fault Injection Testing to Improve DoorDash Reliability

A researcher explains how they implemented their microservice failure testing tool at DoorDash. The tool, Fillibuster, automatically discovers microservice dependencies and injects faults, avoiding the need to design specific individual failure scenarios.

Christopher Meiklejohn — DoorDash

Twitter: @ReinH on Atlassian’s incident write-up

Last week, I shared Atlassian’s outage write-up. This link is a Twitter thread with a critique.

I feel like it is perhaps not a “good look” to repeatedly try to sell your product in your writeup about your product’s catastrophic outage

@ReinH

usefulness of error

“Error” serves a number of functions for an organization: as a defense against entanglement, the illusion of control, as a means for distancing, and as a marker for a failed investigation.

Eric Dobbs

Incident Report for Enom (January 15, 2022)

This is a write-up posted in January for an incident that occurred during an infrastructure migration. I feel like I can relate to every one of the learnings.

Enom (Tucows)

On-Call: Leave It Better Than You Found It

In the past two years, I’ve been participating in on-call rotations as a Site Reliability Engineer at Vinted. Here are some of the practical lessons I’ve learned about the process.

Ernestas Narmontas

How SREs analyze risks to evaluate SLOs

This article is all about finding out what risks exist that may impact your ability to meet your SLOs. Once you’ve done that, you can determine whether your SLOs are realistic.

Ayelet Sachto — Google

How we aligned 200 teams to monitor services with SLOs

When your organization chooses to implement SLOs, how do you get everyone on board? This two-part series has an in-depth look at how Klarna did it.

Andrew Cartine — Klarna

What is an SRE Product Manager?

Subtitle: And why do SRE teams need PMs?

After laying out the reasons why SREs need PMs, this article goes into detail about what a PM can bring to an SRE team.

António Araújo — detech.ai

BellJar: A new framework for testing system recoverability at scale

BellJar helps users find cyclic dependencies in their services, by running totally isolated VMs and requiring users to explicitly enable every external dependency they need in order to bootstrap each service. It has a really neat feature of automatically generating runbooks based on these test cases.

Christopher Bunn and Jie Huang — Meta

Meltdown: Three Mile Island

This week, I watched Netflix’s Meltdown: Three Mile Island, a documentary about the nuclear accident in the US in 1979. It’s not exactly a post-incident write-up, but there’s a lot in there about normalization of deviance, situational awareness, and risk-taking (both in and out of incidents).

Netflix

Outages

Slack
- and this one
Heroku
- Heroku’s been dealing with a security incident since April 13. They performed a mass password reset of all accounts and their GitHub integration has been disabled for days.
Roblox

SRE Weekly Issue #320

lex

May 1, 2022

General

Comments

View on sreweekly.com

Articles

Slack’s Incident on 2-22-22

Slack shared this write-up of their February outage, which involved complex systems interactions and cascading failure.

Laura Nolan — Slack

The Repeat Incident Fallacy

Go watch this lightning talk now! She had me hooked within the first ten seconds.

Hi, my name is Emily Ruppe, I work at Jeli.io, and I am a recovering incident commander, and I am sick of the phrase “to prevent this incident from ever happening again”.

Emily Ruppe — DevOpsDays Rockies

Founding Uber SRE.

This is my personal story of starting the SRE organization at Uber.

This article was written by a former Uber employee and is posted on their personal blog.

Will Larson

Post-Incident Review on the Atlassian April 2022 outage

This is total transparency at its finest. This write-up has all the details you could ever hope for on what went wrong, how they responded, and what comes next.

Sri Viswanath — Atlassian

Site Reliability Engineering Glossary

The target audience is new SREs and executive sponsors who would keep hearing these terms repeatedly but not take the time to read 1000s of words at a time.

[source: author comment on Reddit]

Ash P. — SREPath

That time we unplugged a data center to test our disaster readiness

Dropbox wanted to be able to handle datacenter failure. To reach this goal, they moved from an active/active model to active/passive and spun up a new Disaster Readiness team to rework their failover system.

Krishelle Hardson-Hurley, Ross Delinger, and Tong Pham — Dropbox

SLOs for everyone with Sloth

HelloFresh drove the implementation of SLOs in their Kubernetes-based infrastructure using Prometheus and Sloth.

Chris Loukas — HelloFresh

Delivering Large-Scale Platform Reliability

A Roblox engineer outlines the way that Roblox handles reliability at scale.

Alberto Covarrubias — Roblox

Your On Call Rotation is Harmful (And Here’s How to Make it Better)

[…] let’s look at some common on call antipatterns and some simple things we can do to alleviate their common pitfalls.

Nickolas Means — Sym

Outages

SRE Weekly Issue #319

lex

April 24, 2022

General

Comments

View on sreweekly.com

Articles

Incident Response Isn’t Enough

Be judicious when you generate remediation tasks from incidents, or you can end up investing in the wrong area.

Marc Brooker

ZEN and the art of Reliability

Zendesk SRE has a set of 8 reliability principles that guide what they do.

Jason Smale — Zendesk

Incident management best practices: before the incident

We’re going to talk about a few necessities that enable exceptional incident management.

Service ownership

Incident roles

The incident declaration process

Running incident drills

Robert Ross — FireHydrant

A Foolish Consistency: Consul at Fly.io

I don’t think you’re supposed to use Consul that way…

Read this article to follow along on an interesting design journey.

Thomas Ptacek — Fly.io

Slight Reliability Episode 6 – Afailability

One single metric for availability probably can’t tell you the whole story.

Stephen Townshend — Slight Reliability

Making operational work more visible

We can learn from the process another engineer takes to debug a problem. But often, a ticket or problem description is stripped of the process and just has the answer, hampering learning.

Lorin Hochstein — The ReadME Project (GitHub)

The Merpay SRE Team: Past and future

We’re still not 100% there as a team, but I hope this article will serve as a reference for anyone who might create an SRE team in the future.

@tjun — Mercari

Incident Analysis 101: Techniques for Sharing Incident Findings

This article gives 6 different ways to organize the findings from your retrospective to share with different audiences.

Vanessa Huerta Granda — Jeli

Gyros and Gimbals, oh my! — The James Webb Space Telescope

There’s a great reliability story in the way that the Hubble telescope and the Apollo missions used gimbals — and in the way that the JWST doesn’t.

Robert Barron — IBM

Outages

Hulu
IRS
- The US Internal Revenue Service’s systems went down on the due date for tax filing.
Instagram

SRE Weekly Issue #318

lex

April 18, 2022

General

Comments

View on sreweekly.com

Articles

Errors are constructed, not discovered

This talk summary explores the concept that “error” is a concept applied to an event from the outside, rather than a simple fact. What can this tell us about our after-incident investigation process?

Fred Hebert

PIPEFAIL: How a missing shell option slowed Cloudflare down

Here’s a deep dive into a performance degradation in Cloudflare last December that was related to missing error handling in a shell script.

Alex Forster — Cloudflare

The Scoop: Inside the Longest Atlassian Outage of All Time

Atlassian is having a tough time. It seems as if they deleted a few hundred customers’ data and have to pull it out of their backups one at a time.

Here’s another article about the outage (Steven J. Vaughan-Nichols — The New Stack).

Gergely Orosz — Pragmatic Engineer

Message durability and quality of service across a large-scale distributed system

Cool trick: their client library can fall back to a backup domain if DNS ably.io fails.

Jo Stichbury — Ably

It’s always DNS . . . except when it’s not: A deep dive through gRPC, Kubernetes, and AWS networking

It still wasn’t quite DNS, it was an interesting situation with the Linux kernel’s martian packet detection algorithm.

Laurent Bernaille and David Lentz — DataDog

India’s Inadvertent Missile Launch Underscores the Risk of Accidental Nuclear Warfare

Aside from the terrifying risk of nuclear war, this sounds very similar to the kind of complex system failures SREs deal with routinely.

Zia Mian, M. V. Ramana — Scientific American

The Pros and Cons of Embedded SREs

Both approaches have their pros and cons. The right strategy for your company or team depends, of course, on your needs and priorities.

Quentin Rousseau — Rootly
This article is published by my sponsor, Rootly, but their sponsorship did not influence its inclusion in this issue.

Outages

YouTube
Insteon
- Insteon is down and may not be coming back
Amazon

SRE Weekly Issue #317

lex

April 10, 2022

General

Comments

View on sreweekly.com

Bit of a short issue this week, as I’m currently recovering from COVID-19. Please don’t worry! I seem to have a very minor case, likely thanks in large part to vaccination and masking. I mostly just feel tired.

Articles

In Criminalizing Error, We Are Doomed to Repeat Our Mistakes

This first article about the RaDonda Vaught case gives background and an overview of why prosecuting a nurse for a medication error is a bad idea.

Sending a nurse to prison for causing a patient’s death may satisfy the thirst for vengeance, but it won’t make hospitals any safer.

Jessie Singer — The Nation

The Blame Game Isn’t Very Fun

And this one goes into more detail about Vaught’s case and medical error in general, from the perspective of a doctor.

Rob Poston

GitHub Availability Report: March 2022

GitHub shares more detail about their very rough March.

Jakub Oleksy — GitHub

Incident Analysis 101: Handling Action Items

I formerly advocated that the point of a retrospective was to produce action items. Now, my opinion is more nuanced and along the lines of this article. Action items are important, but we can’t let them get in the way of learning.

Emily Ruppe — Jeli

Taking the hit

I’ve done this before without even meaning to, and looking back on it, it was a great strategy.

When you know that your work will be reviewed by an expert, it’s better to be clear and wrong than vague.

Lorin Hochstein

Outages

Atlassian Cloud
- This affects Jira, Confluence, Statuspage.io, and OpsGenie. The incident has been ongoing for 5 days and counting.
Starlink

SRE Weekly Issue #321

Articles

Outages

SRE Weekly Issue #320

Articles

Outages

SRE Weekly Issue #319

Articles

Outages

SRE Weekly Issue #318

Articles

Outages

SRE Weekly Issue #317

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

A message from our sponsor, Rootly:

Articles

Outages

A message from our sponsor, Rootly:

Articles

Outages

A message from our sponsor, Rootly:

Articles

Outages

A message from our sponsor, Rootly:

Articles

Outages

A message from our sponsor, Rootly:

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues