Reliability is something you do, not something you buy.
When discussing SRE, I love to pose the question, “What does it mean to engineer reliability?”. That’s what this article is all about.
Russ Miles — ChaosIQ
Blameless recently had the privilege of hosting SRE leaders Craig Sebenik, David Blank-Edelman, and Kurt Andersen to discuss how can SREs approach work as done vs work as imagined, how to define SRE and DevOps and the complementary nature of the two, the ethics of purchasing packaged versions of open source software, and more.
Amy Tobey, with guests Craig Sebenik, David Blank-Edelman, and Kurt Andersen — Blameless
Whenever an agent is under pressure to simultaneously act quickly and carefully, they are faced with a double-bind. If they proceed quickly and something goes wrong, they will be faulted for not being careful enough. If they proceed carefully and something goes wrong, they will be faulted for not moving quickly enough.
It’s time for another issue already! This one contains a really great essay by Jamie Woo entitled “What Does Fairness Mean for On-call Rotations?”, about how not all on-call shifts are equal.
Jamie Woo and Emil Stolarsky — Incident Labs
If your frontend has a hard dependency on multiple microservices, their failure rates are compounded. This article fills in the math behind the paper The Tail at Scale and shows that your backends’ SLOs may have to be significantly tighter than the frontend’s.
This post-incident analysis details a case of a hard dependency that needn’t be hard, taking down the Heroku API, along with a fall-back that didn’t work as intended.
I love Julia Evans’s ability to teach me something new that I didn’t realize I didn’t know.