Articles
I chatted with Emily Arnott of Blameless for a solid hour about everything from the origins of this newsletter and how I make it, to my thoughts on SRE and where it’s going. Somehow she managed to fit it all into this article. Thanks, Emily!
Emily Arnott — Blameless
The section on TTR (Time To Recovery) really caught my eye, both by confirming that MTTR is generally not a useful metric, and also finding one case where TTR does seem to be predictive.
The Spotify engineering blog seems to be down as of this publishing, so here’s the archive.org version.
Clint Byrum — Spotify
SRE concepts apply wonderfully well to compliance and governance. Each field has a lot to learn from the other.
Jennifer Riggins — The New Stack
More than ever, we should all be focused on shipping great products, retaining high-demand engineers, and building trust with customers. And investing in a thoughtful incident management strategy is one way to get there. Let’s explore how.
Robert Ross — FireHydrant
At this week’s DevOps Enterprise Summit (DOES) Europe, Vanguard talked about how they made the move from traditional architecture to the majority in the cloud, adopted site reliability engineering and even built their own customer-facing SaaS.
Jennifer Riggins — The New Stack
This article has a great discussion of the risks of larger, less frequent deploys. It goes on to explain how they transitioned to smaller and more frequent deploys while focusing on safety.
Will Sewell — Monzo
What makes this article special is its focus on addressing the common concerns that people have when you try to get them to own their code for its full lifecycle. It offers practical advice to win folks over.
Martha Lambert — incident.io
Sounds like there were some pretty great talks at SRECon. I gotta admit, I’m kinda having some FOMO.
Emily Arnott — Blameless