There’s a lot you can get out of this one even if you don’t happen to be using one of the helm charts they evaluated. Their evaluation criteria are useful and easy to apply to other charts — and also a great study guide for those new to kubernetes.
Prequel
This is the best explanation I’ve seen yet of exactly why SSL certificates are so difficult to get right in production.
Lorin Hochstein
An article on the importance of incident simulation for training, drawing from external experience in using simulations.
Stuart Rimell — Uptime Labs
I especially like the discussion of checklists, since they are often touted as a solution to the attention problem.
Chris Siebenmann
This is a new product/feature announcement, but it also has a ton of detail on their implementation, and it’s really neat to see how they built cloud provider region failure tolerance into WarpStream.
Dani Torramilans — WarpStream
It’s interesting to think of money spent on improving reliability as offsetting the cost of responding to incidents. It’s not one-to-one, but there’s an argument to be made here.
Florian Hoeppner
An explanation of the Nemawashi principle for driving buy-in for your initiatives. This is not specifically SRE-targeted, but we so often find ourselves seeking buy-in for our reliability initiatives.
Matt Hodgkins
The next time you’re flooded with alerts, ask yourself: Does this metric reflect customer pain, or is it just noise? The answer could change how you approach reliability forever.
Spiros Economakis
