Nine entire years ago, I threw together a few “issues” with my favorite SRE articles, installed WordPress, and added a subscription form, with no clue what I was doing. It’s only thanks to you folks, the thousands of subscribers and the many authors of great SRE content, that I’ve been able to keep this up for so long. Thank you, you make it fun! And as always, thanks also to my sponsors, former, current, and future, who’ve helped make this whole thing possible.
When we try to optimize MTTR as if it’s a meaningful statistic, we run into trouble. This article does a great job of explaining why, drawing from concepts and techniques in manufacturing.
Lorin Hochstein
This article introduces the concepts of “shared nothing” and “shared storage” in distributed systems and then explains why they chose shared storage for WarpStream.
Richard Artoul — WarpStream
How much did that incident cost in lost revenue? This article says you should avoid including that number in your incident management process, because it’s a trap.
Tom Webster — Rootly
Pushing a system to 100% CPU utilization can cause workloads to be slowed down. This article is about experimentally finding the sweet spot between utilizing CPUs as much as possible and avoiding performance issues.
Andreas Strikos — GitHub
This article has a couple of strategies for handling concurrent updates to the same row in MySQL, with and without locking.
Sönke Ruempler
They do it with a dead man’s switch, implemented using a backup alert provider.
Lawrence Jones — incident.io
I came across part 6 first and I need to go back and read the rest, but I just had to share this now, because if the cool concept it contains: that efficiency and resiliency are at odds with each other.
Uwe Friedrichsen
This is so cool! Their system automatically figures out which API calls are critical to each user journey and keeps the list updated.
yakenji — Mercari