SRE Weekly Issue #322

Bit of a short issue this week. This morning, I stepped on my phone, crushing it mightily beneath my bootheel. Unfortunately a lot of my automation for reviewing articles is on there… thank goodness I have functioning backups.

A message from our sponsor, Rootly:

Manage incidents directly from Slack with Rootly 🚒. Automate manual admin tasks like creating incident channel, Jira and Zoom, paging and adding responders, postmortem timeline, setting up reminders, and more. Book a demo (+ get a snazzy Rootly lego set):
https://rootly.com/demo/

Articles

What? Actually, it’s a pretty good analogy.

  Emily Arnott — Blameless

Mercari has this update to their previous article on their embedded SRE team with more details on how their embedding model works.

  Taichi Nakashima — Mercari

Interesting things happen when you combine tail latency with a microservice architecture.

  Marc Brooker

Their starting point was paging for every single exception raised by their application. Here’s how they tempered that a bit to get a handle on their paging volume.

  Lisa Karlin Curtis — incident.io

This article draws from the “SRE Hierarchy” in Google’s SRE book (which itself is a reference to Maslow’s hierarchy of needs). It recasts the SRE hierarchy as a path to maturity.

  Ash P. — SREPath

Google posted this summary of an incident from late April. A configuration change had the unintended effect of causing livestream view requests to fail.

  Google

Outages

  • Xbox
    • I don’t normally bother with game outages, but this one caught my eye. During the 4-day outage, customers were unable to play Xbox games that they had already purchased.

  • Twitter
  • Coinbase

SRE Weekly Issue #321

A message from our sponsor, Rootly:

Manage incidents directly from Slack with Rootly 🚒. Automate manual admin tasks like creating incident channel, Jira and Zoom, paging and adding responders, postmortem timeline, setting up reminders, and more. Book a demo (+ get a snazzy Rootly lego set):
https://rootly.com/demo/

Articles

A researcher explains how they implemented their microservice failure testing tool at DoorDash. The tool, Fillibuster, automatically discovers microservice dependencies and injects faults, avoiding the need to design specific individual failure scenarios.

  Christopher Meiklejohn — DoorDash

Last week, I shared Atlassian’s outage write-up. This link is a Twitter thread with a critique.

I feel like it is perhaps not a “good look” to repeatedly try to sell your product in your writeup about your product’s catastrophic outage

  @ReinH

“Error” serves a number of functions for an organization: as a defense against entanglement, the illusion of control, as a means for distancing, and as a marker for a failed investigation.

  Eric Dobbs

This is a write-up posted in January for an incident that occurred during an infrastructure migration. I feel like I can relate to every one of the learnings.

  Enom (Tucows)

In the past two years, I’ve been participating in on-call rotations as a Site Reliability Engineer at Vinted. Here are some of the practical lessons I’ve learned about the process.

  Ernestas Narmontas

This article is all about finding out what risks exist that may impact your ability to meet your SLOs. Once you’ve done that, you can determine whether your SLOs are realistic.

  Ayelet Sachto — Google

When your organization chooses to implement SLOs, how do you get everyone on board? This two-part series has an in-depth look at how Klarna did it.

  Andrew Cartine — Klarna

Subtitle: And why do SRE teams need PMs?

After laying out the reasons why SREs need PMs, this article goes into detail about what a PM can bring to an SRE team.

  António Araújo — detech.ai

BellJar helps users find cyclic dependencies in their services, by running totally isolated VMs and requiring users to explicitly enable every external dependency they need in order to bootstrap each service. It has a really neat feature of automatically generating runbooks based on these test cases.

  Christopher Bunn and Jie Huang — Meta

This week, I watched Netflix’s Meltdown: Three Mile Island, a documentary about the nuclear accident in the US in 1979. It’s not exactly a post-incident write-up, but there’s a lot in there about normalization of deviance, situational awareness, and risk-taking (both in and out of incidents).

  Netflix

Outages

  • Slack
  • Heroku
    • Heroku’s been dealing with a security incident since April 13. They performed a mass password reset of all accounts and their GitHub integration has been disabled for days.

  • Roblox

SRE Weekly Issue #320

A message from our sponsor, Rootly:

Manage incidents directly from Slack with Rootly 🚒. Automate manual admin tasks like creating incident channel, Jira and Zoom, paging and adding responders, postmortem timeline, setting up reminders, and more. Book a demo (+ get a snazzy Rootly lego set):
https://rootly.com/demo/

Articles

Slack shared this write-up of their February outage, which involved complex systems interactions and cascading failure.

  Laura Nolan — Slack

Go watch this lightning talk now! She had me hooked within the first ten seconds.

Hi, my name is Emily Ruppe, I work at Jeli.io, and I am a recovering incident commander, and I am sick of the phrase “to prevent this incident from ever happening again”.

  Emily Ruppe — DevOpsDays Rockies

This is my personal story of starting the SRE organization at Uber.

This article was written by a former Uber employee and is posted on their personal blog.

  Will Larson

This is total transparency at its finest. This write-up has all the details you could ever hope for on what went wrong, how they responded, and what comes next.

  Sri Viswanath — Atlassian

The target audience is new SREs and executive sponsors who would keep hearing these terms repeatedly but not take the time to read 1000s of words at a time.

[source: author comment on Reddit]

  Ash P. — SREPath

Dropbox wanted to be able to handle datacenter failure. To reach this goal, they moved from an active/active model to active/passive and spun up a new Disaster Readiness team to rework their failover system.

  Krishelle Hardson-Hurley, Ross Delinger, and Tong Pham — Dropbox

HelloFresh drove the implementation of SLOs in their Kubernetes-based infrastructure using Prometheus and Sloth.

  Chris Loukas — HelloFresh

A Roblox engineer outlines the way that Roblox handles reliability at scale.

  Alberto Covarrubias — Roblox

[…] let’s look at some common on call antipatterns and some simple things we can do to alleviate their common pitfalls.

  Nickolas Means — Sym

Outages

SRE Weekly Issue #319

A message from our sponsor, Rootly:

Manage incidents directly from Slack with Rootly 🚒. Automate manual admin tasks like creating incident channel, Jira and Zoom, paging and adding responders, postmortem timeline, setting up reminders, and more. Book a demo (+ get a snazzy Rootly lego set):
https://rootly.com/demo/

Articles

Be judicious when you generate remediation tasks from incidents, or you can end up investing in the wrong area.

  Marc Brooker

Zendesk SRE has a set of 8 reliability principles that guide what they do.

  Jason Smale — Zendesk

We’re going to talk about a few necessities that enable exceptional incident management.

  1. Service ownership
  2. Incident roles
  3. The incident declaration process
  4. Running incident drills

  Robert Ross — FireHydrant

I don’t think you’re supposed to use Consul that way…

Read this article to follow along on an interesting design journey.

  Thomas Ptacek — Fly.io

One single metric for availability probably can’t tell you the whole story.

  Stephen Townshend — Slight Reliability

We can learn from the process another engineer takes to debug a problem. But often, a ticket or problem description is stripped of the process and just has the answer, hampering learning.

  Lorin Hochstein — The ReadME Project (GitHub)

We’re still not 100% there as a team, but I hope this article will serve as a reference for anyone who might create an SRE team in the future.

  @tjun — Mercari

This article gives 6 different ways to organize the findings from your retrospective to share with different audiences.

  Vanessa Huerta Granda — Jeli

There’s a great reliability story in the way that the Hubble telescope and the Apollo missions used gimbals — and in the way that the JWST doesn’t.

  Robert Barron — IBM

Outages

  • Hulu
  • IRS
    • The US Internal Revenue Service’s systems went down on the due date for tax filing.

  • Instagram

SRE Weekly Issue #318

A message from our sponsor, Rootly:

Manage incidents directly from Slack with Rootly 🚒. Automate manual admin tasks like creating incident channel, Jira and Zoom, paging and adding responders, postmortem timeline, setting up reminders, and more. Book a demo (+ get a snazzy Rootly lego set):
https://rootly.com/demo/

Articles

This talk summary explores the concept that “error” is a concept applied to an event from the outside, rather than a simple fact. What can this tell us about our after-incident investigation process?

  Fred Hebert

Here’s a deep dive into a performance degradation in Cloudflare last December that was related to missing error handling in a shell script.

  Alex Forster — Cloudflare

Atlassian is having a tough time. It seems as if they deleted a few hundred customers’ data and have to pull it out of their backups one at a time.

Here’s another article about the outage (Steven J. Vaughan-Nichols — The New Stack).

  Gergely Orosz — Pragmatic Engineer

Cool trick: their client library can fall back to a backup domain if DNS ably.io fails.

  Jo Stichbury — Ably

It still wasn’t quite DNS, it was an interesting situation with the Linux kernel’s martian packet detection algorithm.

  Laurent Bernaille and David Lentz — DataDog

Aside from the terrifying risk of nuclear war, this sounds very similar to the kind of complex system failures SREs deal with routinely.

  Zia Mian, M. V. Ramana — Scientific American

Both approaches have their pros and cons. The right strategy for your company or team depends, of course, on your needs and priorities.

  Quentin Rousseau — Rootly
This article is published by my sponsor, Rootly, but their sponsorship did not influence its inclusion in this issue.

Outages

A production of Tinker Tinker Tinker, LLC Frontier Theme