SRE Weekly Issue #293

A message from our sponsor, Rootly:

Manage incidents directly from Slack with Rootly 🚒. Automate manual admin tasks like creating incident channel, Jira and Zoom, paging the right team, postmortem timeline, setting up reminders, and more. Book a demo:
https://rootly.io/?utm_source=sreweekly

Articles

It’s one thing to say you accept call-outs of unsafe situations — it’s another to actually do it. This cardiac surgeon shares what it’s like when high reliability organizations get it wrong.

Robert Poston, MD

The game has been a victim of its own success, and the developers have had to put in quite a lot of work to deal with the load.

PezRadar — Blizzard

This includes some lesser-known roles like Social Media Lead, Legal/Compliance Lead, and Partner Lead.

JJ Tang — Rootly

This article is published by my sponsor, Rootly, but their sponsorship did not influence its inclusion in this issue.

There are a couple of great sections in this article, including “blameless” retrospectives that aren’t actually blameless, and being judicious in which remediation actions you take.

Chris Evans — incident.io

I love the idea that chaos monkey could actually be propping your infrastructure up. Oops.

Lorin Hochstein

I have to say, I’m really liking this DNS series.

Jan Schaumann

What? Why the heck am I including this here?

First, let’s all keep in mind that this situation is still very much unfolding, and not much is concretely known about what happened. It’s also emotionally fraught, especially for the victims and their families, and my heart goes out to them.

The thing that caught my eye about this article is that this looks like a classic complex system failure. There’s so much at play that led to this horrible accident, as outlined in this article and others, like this one (Julia Conley, Salon).

Aya Elamroussi, Chloe Melas and Claudia Dominguez — CNN

Outages

Updated: October 24, 2021 — 9:07 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme