I’m on vacation, so I prepared this issue in advance. Practically speaking, that just means there’s no Outages section this week. See you all next week!
P.S. Okay, I know I said no outages, but I will say that I’m keeping an eye on the Southwest Airlines outage, because we’re kind of counting on them to get home in a few days…
Articles
Yes.
Chris Evans — incident.io
If you don’t test them, you don’t have backups; you have a lottery ticket. Except the chance of winning is high. And the prize is data loss.
Emily Arnott — Blameless
Being blameless does not mean blaming no one outwardly and blaming yourself inside your head.
Emily Arnott — Blameless
LinkedIn’s Alert Correlation system posts recommendations to Slack about which microservice may be at the heart of an incident.
Nishant Singh — LinkedIn
I always get the two confused. This article explains the difference and gives tips for writing runbooks. More on runbooks from the same folks here.
Jessica Abelson — Transposit
There are many intricate details in there! For example, the S3 SLA is per calendar month, not a rolling window, so the SLA of your product based on it might need to match.
Alex Ewerlöf
The ins and outs of conducting an effective postmortem. Ready templates and examples from leading organizations around the world!
Prathamesh Sonpatki — Last9