General

SRE Weekly Issue #320

A message from our sponsor, Rootly:

Manage incidents directly from Slack with Rootly 🚒. Automate manual admin tasks like creating incident channel, Jira and Zoom, paging and adding responders, postmortem timeline, setting up reminders, and more. Book a demo (+ get a snazzy Rootly lego set):
https://rootly.com/demo/

Articles

Slack shared this write-up of their February outage, which involved complex systems interactions and cascading failure.

  Laura Nolan — Slack

Go watch this lightning talk now! She had me hooked within the first ten seconds.

Hi, my name is Emily Ruppe, I work at Jeli.io, and I am a recovering incident commander, and I am sick of the phrase “to prevent this incident from ever happening again”.

  Emily Ruppe — DevOpsDays Rockies

This is my personal story of starting the SRE organization at Uber.

This article was written by a former Uber employee and is posted on their personal blog.

  Will Larson

This is total transparency at its finest. This write-up has all the details you could ever hope for on what went wrong, how they responded, and what comes next.

  Sri Viswanath — Atlassian

The target audience is new SREs and executive sponsors who would keep hearing these terms repeatedly but not take the time to read 1000s of words at a time.

[source: author comment on Reddit]

  Ash P. — SREPath

Dropbox wanted to be able to handle datacenter failure. To reach this goal, they moved from an active/active model to active/passive and spun up a new Disaster Readiness team to rework their failover system.

  Krishelle Hardson-Hurley, Ross Delinger, and Tong Pham — Dropbox

HelloFresh drove the implementation of SLOs in their Kubernetes-based infrastructure using Prometheus and Sloth.

  Chris Loukas — HelloFresh

A Roblox engineer outlines the way that Roblox handles reliability at scale.

  Alberto Covarrubias — Roblox

[…] let’s look at some common on call antipatterns and some simple things we can do to alleviate their common pitfalls.

  Nickolas Means — Sym

Outages

SRE Weekly Issue #319

A message from our sponsor, Rootly:

Manage incidents directly from Slack with Rootly 🚒. Automate manual admin tasks like creating incident channel, Jira and Zoom, paging and adding responders, postmortem timeline, setting up reminders, and more. Book a demo (+ get a snazzy Rootly lego set):
https://rootly.com/demo/

Articles

Be judicious when you generate remediation tasks from incidents, or you can end up investing in the wrong area.

  Marc Brooker

Zendesk SRE has a set of 8 reliability principles that guide what they do.

  Jason Smale — Zendesk

We’re going to talk about a few necessities that enable exceptional incident management.

  1. Service ownership
  2. Incident roles
  3. The incident declaration process
  4. Running incident drills

  Robert Ross — FireHydrant

I don’t think you’re supposed to use Consul that way…

Read this article to follow along on an interesting design journey.

  Thomas Ptacek — Fly.io

One single metric for availability probably can’t tell you the whole story.

  Stephen Townshend — Slight Reliability

We can learn from the process another engineer takes to debug a problem. But often, a ticket or problem description is stripped of the process and just has the answer, hampering learning.

  Lorin Hochstein — The ReadME Project (GitHub)

We’re still not 100% there as a team, but I hope this article will serve as a reference for anyone who might create an SRE team in the future.

  @tjun — Mercari

This article gives 6 different ways to organize the findings from your retrospective to share with different audiences.

  Vanessa Huerta Granda — Jeli

There’s a great reliability story in the way that the Hubble telescope and the Apollo missions used gimbals — and in the way that the JWST doesn’t.

  Robert Barron — IBM

Outages

  • Hulu
  • IRS
    • The US Internal Revenue Service’s systems went down on the due date for tax filing.

  • Instagram

SRE Weekly Issue #318

A message from our sponsor, Rootly:

Manage incidents directly from Slack with Rootly 🚒. Automate manual admin tasks like creating incident channel, Jira and Zoom, paging and adding responders, postmortem timeline, setting up reminders, and more. Book a demo (+ get a snazzy Rootly lego set):
https://rootly.com/demo/

Articles

This talk summary explores the concept that “error” is a concept applied to an event from the outside, rather than a simple fact. What can this tell us about our after-incident investigation process?

  Fred Hebert

Here’s a deep dive into a performance degradation in Cloudflare last December that was related to missing error handling in a shell script.

  Alex Forster — Cloudflare

Atlassian is having a tough time. It seems as if they deleted a few hundred customers’ data and have to pull it out of their backups one at a time.

Here’s another article about the outage (Steven J. Vaughan-Nichols — The New Stack).

  Gergely Orosz — Pragmatic Engineer

Cool trick: their client library can fall back to a backup domain if DNS ably.io fails.

  Jo Stichbury — Ably

It still wasn’t quite DNS, it was an interesting situation with the Linux kernel’s martian packet detection algorithm.

  Laurent Bernaille and David Lentz — DataDog

Aside from the terrifying risk of nuclear war, this sounds very similar to the kind of complex system failures SREs deal with routinely.

  Zia Mian, M. V. Ramana — Scientific American

Both approaches have their pros and cons. The right strategy for your company or team depends, of course, on your needs and priorities.

  Quentin Rousseau — Rootly
This article is published by my sponsor, Rootly, but their sponsorship did not influence its inclusion in this issue.

Outages

SRE Weekly Issue #317

Bit of a short issue this week, as I’m currently recovering from COVID-19. Please don’t worry! I seem to have a very minor case, likely thanks in large part to vaccination and masking. I mostly just feel tired.

A message from our sponsor, Rootly:

Manage incidents directly from Slack with Rootly 🚒. Automate manual admin tasks like creating incident channel, Jira and Zoom, paging and adding responders, postmortem timeline, setting up reminders, and more. Book a demo (+ get a snazzy Rootly lego set):
https://rootly.com/demo/

Articles

This first article about the RaDonda Vaught case gives background and an overview of why prosecuting a nurse for a medication error is a bad idea.

Sending a nurse to prison for causing a patient’s death may satisfy the thirst for vengeance, but it won’t make hospitals any safer.

  Jessie Singer — The Nation

And this one goes into more detail about Vaught’s case and medical error in general, from the perspective of a doctor.

  Rob Poston

GitHub shares more detail about their very rough March.

  Jakub Oleksy — GitHub

I formerly advocated that the point of a retrospective was to produce action items. Now, my opinion is more nuanced and along the lines of this article. Action items are important, but we can’t let them get in the way of learning.

  Emily Ruppe — Jeli

I’ve done this before without even meaning to, and looking back on it, it was a great strategy.

When you know that your work will be reviewed by an expert, it’s better to be clear and wrong than vague.

  Lorin Hochstein

Outages

  • Atlassian Cloud
    • This affects Jira, Confluence, Statuspage.io, and OpsGenie. The incident has been ongoing for 5 days and counting.

  • Starlink

SRE Weekly Issue #316

I’m on vacation, so I prepared this issue in advance. Practically speaking, that just means there’s no Outages section this week. See you all next week!

P.S. Okay, I know I said no outages, but I will say that I’m keeping an eye on the Southwest Airlines outage, because we’re kind of counting on them to get home in a few days…

A message from our sponsor, Rootly:

Manage incidents directly from Slack with Rootly 🚒. Automate manual admin tasks like creating incident channel, Jira and Zoom, paging and adding responders, postmortem timeline, setting up reminders, and more. Book a demo (+ get a snazzy Rootly lego set):
https://rootly.com/demo/

Articles

Yes.

  Chris Evans — incident.io

If you don’t test them, you don’t have backups; you have a lottery ticket. Except the chance of winning is high. And the prize is data loss.

  Emily Arnott — Blameless

Being blameless does not mean blaming no one outwardly and blaming yourself inside your head.

  Emily Arnott — Blameless

LinkedIn’s Alert Correlation system posts recommendations to Slack about which microservice may be at the heart of an incident.

  Nishant Singh — LinkedIn

I always get the two confused. This article explains the difference and gives tips for writing runbooks. More on runbooks from the same folks here.

  Jessica Abelson — Transposit

There are many intricate details in there! For example, the S3 SLA is per calendar month, not a rolling window, so the SLA of your product based on it might need to match.

  Alex Ewerlöf

The ins and outs of conducting an effective postmortem. Ready templates and examples from leading organizations around the world!

  Prathamesh Sonpatki — Last9

A production of Tinker Tinker Tinker, LLC Frontier Theme