Here’s the next incredibly useful article in Jeli’s Incident Analysis 101 series. This one covers the skills and traits of a good incident analyst, along with what not to look for.
Laura Maguire — Jeli
This article has a remarkable level of detail on 13 incidents at Twitter that were related to cache. The authors open with an explanation of why they focused on cache-related incidents.
Dan Luu and Yao Yue
[…] the same three pillars form the core of any good process, whether it’s for the largest e-commerce giant or a scrappy SaaS startup.
The three pillars are:
Lisa Karlin Curtis — incident.io
This one recommends doing away with “P0” and “P5” and instead using plain words like “Low” and “High”.
Stephen Whitworth — incident.io
Feature flags can be a useful way to resolve user impact during an incident.
Weihan Li — Rootly
This article is published by my sponsor, Rootly, but their sponsorship did not influence its inclusion in this issue.
Implementing a dead-switch for your alerting tool is really important so that you don’t blissfully sleep through an outage.
Chris Loukas — HelloFresh
As SRE #1, the author of this article got to define the SRE role from the ground up.
Fred Hebert — Honeycomb
In this article, I will share five lessons I learned about starting SRE teams (or engagements, or organizations).
This article is all about the shape of an SRE team, rather than technical details like SLOs and such.
Andrea Spadaccini — USENIX ;login: