SRE Weekly Issue #318

A message from our sponsor, Rootly:

Manage incidents directly from Slack with Rootly 🚒. Automate manual admin tasks like creating incident channel, Jira and Zoom, paging and adding responders, postmortem timeline, setting up reminders, and more. Book a demo (+ get a snazzy Rootly lego set):
https://rootly.com/demo/

Articles

This talk summary explores the concept that “error” is a concept applied to an event from the outside, rather than a simple fact. What can this tell us about our after-incident investigation process?

  Fred Hebert

Here’s a deep dive into a performance degradation in Cloudflare last December that was related to missing error handling in a shell script.

  Alex Forster — Cloudflare

Atlassian is having a tough time. It seems as if they deleted a few hundred customers’ data and have to pull it out of their backups one at a time.

Here’s another article about the outage (Steven J. Vaughan-Nichols — The New Stack).

  Gergely Orosz — Pragmatic Engineer

Cool trick: their client library can fall back to a backup domain if DNS ably.io fails.

  Jo Stichbury — Ably

It still wasn’t quite DNS, it was an interesting situation with the Linux kernel’s martian packet detection algorithm.

  Laurent Bernaille and David Lentz — DataDog

Aside from the terrifying risk of nuclear war, this sounds very similar to the kind of complex system failures SREs deal with routinely.

  Zia Mian, M. V. Ramana — Scientific American

Both approaches have their pros and cons. The right strategy for your company or team depends, of course, on your needs and priorities.

  Quentin Rousseau — Rootly
This article is published by my sponsor, Rootly, but their sponsorship did not influence its inclusion in this issue.

Outages

SRE Weekly Issue #317

Bit of a short issue this week, as I’m currently recovering from COVID-19. Please don’t worry! I seem to have a very minor case, likely thanks in large part to vaccination and masking. I mostly just feel tired.

A message from our sponsor, Rootly:

Manage incidents directly from Slack with Rootly 🚒. Automate manual admin tasks like creating incident channel, Jira and Zoom, paging and adding responders, postmortem timeline, setting up reminders, and more. Book a demo (+ get a snazzy Rootly lego set):
https://rootly.com/demo/

Articles

This first article about the RaDonda Vaught case gives background and an overview of why prosecuting a nurse for a medication error is a bad idea.

Sending a nurse to prison for causing a patient’s death may satisfy the thirst for vengeance, but it won’t make hospitals any safer.

  Jessie Singer — The Nation

And this one goes into more detail about Vaught’s case and medical error in general, from the perspective of a doctor.

  Rob Poston

GitHub shares more detail about their very rough March.

  Jakub Oleksy — GitHub

I formerly advocated that the point of a retrospective was to produce action items. Now, my opinion is more nuanced and along the lines of this article. Action items are important, but we can’t let them get in the way of learning.

  Emily Ruppe — Jeli

I’ve done this before without even meaning to, and looking back on it, it was a great strategy.

When you know that your work will be reviewed by an expert, it’s better to be clear and wrong than vague.

  Lorin Hochstein

Outages

  • Atlassian Cloud
    • This affects Jira, Confluence, Statuspage.io, and OpsGenie. The incident has been ongoing for 5 days and counting.

  • Starlink

SRE Weekly Issue #316

I’m on vacation, so I prepared this issue in advance. Practically speaking, that just means there’s no Outages section this week. See you all next week!

P.S. Okay, I know I said no outages, but I will say that I’m keeping an eye on the Southwest Airlines outage, because we’re kind of counting on them to get home in a few days…

A message from our sponsor, Rootly:

Manage incidents directly from Slack with Rootly 🚒. Automate manual admin tasks like creating incident channel, Jira and Zoom, paging and adding responders, postmortem timeline, setting up reminders, and more. Book a demo (+ get a snazzy Rootly lego set):
https://rootly.com/demo/

Articles

Yes.

  Chris Evans — incident.io

If you don’t test them, you don’t have backups; you have a lottery ticket. Except the chance of winning is high. And the prize is data loss.

  Emily Arnott — Blameless

Being blameless does not mean blaming no one outwardly and blaming yourself inside your head.

  Emily Arnott — Blameless

LinkedIn’s Alert Correlation system posts recommendations to Slack about which microservice may be at the heart of an incident.

  Nishant Singh — LinkedIn

I always get the two confused. This article explains the difference and gives tips for writing runbooks. More on runbooks from the same folks here.

  Jessica Abelson — Transposit

There are many intricate details in there! For example, the S3 SLA is per calendar month, not a rolling window, so the SLA of your product based on it might need to match.

  Alex Ewerlöf

The ins and outs of conducting an effective postmortem. Ready templates and examples from leading organizations around the world!

  Prathamesh Sonpatki — Last9

SRE Weekly Issue #315

I’m going on vacation, so I’m going to prepare next week’s issue in advance. It’ll look much like most issues, except there won’t be an Outages section. See you all in two weeks!

A message from our sponsor, Rootly:

Manage incidents directly from Slack with Rootly 🚒. Automate manual admin tasks like creating incident channel, Jira and Zoom, paging and adding responders, postmortem timeline, setting up reminders, and more. Book a demo (+ get a snazzy Rootly lego set):
https://rootly.com/demo/

Articles

In the previous articles in this series, they described a process of interviewing incident responders before a full retrospective meeting. This one discusses what to do if you can’t conduct those interviews, and the particular challenges this will bring and how to deal with them.

  Emily Ruppe — Jeli

Some interesting ideas on potential downsides of circuit breakers and how we might ameliorate them.

  Marc Brooker

GitHub has had a bit of a hard time lately. Here’s an update on what they’re dealing with and how they’re planning to address it.

  Keith Ballinger — GitHub

All sorts of “mean time to” metrics, including 6(!) different MTTR metrics and how they might be used.

  Alex Ewerlöf — InfoQ

This is a huge 100+-page report on the benefits of a model in which development teams own the operation of their systems. There’s a lot in here, with carefully spelled-out pros/cons and cost/benefit analyses. Need to convince someone? Send them this.

We’ve written this playbook for CxOs, product managers, delivery managers, and
operations managers.

  Bethan Timmins and Steve Smith — Equal Experts

It’s easy to miss MTUs, until they sneak up on you and cause really confusing problems.

  Aaron Kalair — Hudl

Should you compensate for on-call? How? I really want to see more articles about this, so send them my way if you see or write any.

  Chris Evans — Incident.io

Some good tips in this article, and I love the case studies.

  Prathamesh Sonpatki — Last9

Outages

SRE Weekly Issue #314

A message from our sponsor, Rootly:

Manage incidents directly from Slack with Rootly 🚒. Automate manual admin tasks like creating incident channel, Jira and Zoom, paging and adding responders, postmortem timeline, setting up reminders, and more. Book a demo (+ get a snazzy Rootly lego set):
https://rootly.com/demo/

Articles

The first episode of this new podcast answers the question in three ways: what Google says SRE is, what the podcast host thinks it is, and how people seem to be practicing SRE.

  Stephen Townsend — Slight Reliability

This aircraft accident report puts heavy emphasis on the deeper contributing factors rather than a seemingly obvious single root cause.

  Mentour Pilot

Google posted an incident report for the March 8 incident involving Traffic Director.

  Google

This one includes some neat graphs made by showing load and theoretical success rates for various strategies such as no retries, N retries, token buckets, and circuit breakers.

  Marc Brooker

What if your alerting system goes down? These folks set up a dead-switch to handle that situation.

  Miedwar Meshbesher — Nanit

Strategies for creating concise, efficient communication between teams during incidents and operational suprises

[…] communications must be precise and descriptive to minimize confusion and accelerate a responder’s ability to assess and remedy the situation.

  Steve Stevens — Transposit

I really love these articles about hardware errors. They’re more common than we tend to realize.

  Harish Dattatraya Dixit — Facebook

Outages

SRE WEEKLY © 2015 Frontier Theme