SRE Weekly Issue #316

I’m on vacation, so I prepared this issue in advance. Practically speaking, that just means there’s no Outages section this week. See you all next week!

P.S. Okay, I know I said no outages, but I will say that I’m keeping an eye on the Southwest Airlines outage, because we’re kind of counting on them to get home in a few days…

A message from our sponsor, Rootly:

Manage incidents directly from Slack with Rootly 🚒. Automate manual admin tasks like creating incident channel, Jira and Zoom, paging and adding responders, postmortem timeline, setting up reminders, and more. Book a demo (+ get a snazzy Rootly lego set):
https://rootly.com/demo/

Articles

Yes.

  Chris Evans — incident.io

If you don’t test them, you don’t have backups; you have a lottery ticket. Except the chance of winning is high. And the prize is data loss.

  Emily Arnott — Blameless

Being blameless does not mean blaming no one outwardly and blaming yourself inside your head.

  Emily Arnott — Blameless

LinkedIn’s Alert Correlation system posts recommendations to Slack about which microservice may be at the heart of an incident.

  Nishant Singh — LinkedIn

I always get the two confused. This article explains the difference and gives tips for writing runbooks. More on runbooks from the same folks here.

  Jessica Abelson — Transposit

There are many intricate details in there! For example, the S3 SLA is per calendar month, not a rolling window, so the SLA of your product based on it might need to match.

  Alex Ewerlöf

The ins and outs of conducting an effective postmortem. Ready templates and examples from leading organizations around the world!

  Prathamesh Sonpatki — Last9

Updated: April 3, 2022 — 9:03 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme