This article presents in incident theme that I’ve lived through many times but never had such a pithy name for.
Geoff Townsend — Blameless
There are risks and downsides inherent in a distributed system, so it’s worth thinking about whether you really need one.
Pipitz — Adevinta
And here’s a counterpoint to the previous article: deciding whether you need a distributed system isn’t just about scale.
Marc Brooker
The effectiveness of memes in availability campaigns.
This short post is a pile of memes, and the video one is top notch.
Ross Brodbeck
Paraphrasing part of this article: either you didn’t understand your system fully when you wrote the alert, or there really are sporadic failures.
Chris Siebenmann
If you’ve ever created an action item from an incident along the lines of “don’t take unnecessary risks in the future”, you need to read this one.
The rest of you need to read it too.
Lorin Hochstein
A how-to for building anomaly detection alerting in Prometheus with specific config examples.
Karl Stoney
A panicked engineer asks reddit’s r/sre about an incident they caused: how could they have done better? Will they be fired? The comments are spot on, and this conversation is fresh enough that you could jump in too if you’re interested.
u/console_fulcrum and others — reddit
Last Monday, Honeycomb had an outaged related to a schema migration involving MySQL’s ENUM data type, and they posted this incident report.
Bonus content: I wasn’t aware of ENUMs at all, so I had to brush up with this article: 8 Reasons Why MySQL’s ENUM Data Type Is Evil.
Honeycomb
Full disclosure: Honeycomb is my employer.
An experienced SRE discusses the skills and experiences you might be quizzed about in an interview for an SRE role.
Krishna Vinnakota — DZone