SRE Weekly Issue #217

Articles

Pre-requisites to Practicing Reliability?

Reliability is something you do, not something you buy.

When discussing SRE, I love to pose the question, “What does it mean to engineer reliability?”. That’s what this article is all about.

Russ Miles — ChaosIQ

Thought Leadership Panel: What is a ‘Real’ SRE?

Blameless recently had the privilege of hosting SRE leaders Craig Sebenik, David Blank-Edelman, and Kurt Andersen to discuss how can SREs approach work as done vs work as imagined, how to define SRE and DevOps and the complementary nature of the two, the ethics of purchasing packaged versions of open source software, and more.

Amy Tobey, with guests Craig Sebenik, David Blank-Edelman, and Kurt Andersen — Blameless

The inevitable double bind

Whenever an agent is under pressure to simultaneously act quickly and carefully, they are faced with a double-bind. If they proceed quickly and something goes wrong, they will be faulted for not being careful enough. If they proceed carefully and something goes wrong, they will be faulted for not moving quickly enough.

Lorin Hochstein

The Post-Incident Review Issue #3

It’s time for another issue already! This one contains a really great essay by Jamie Woo entitled “What Does Fairness Mean for On-call Rotations?”, about how not all on-call shifts are equal.

Jamie Woo and Emil Stolarsky — Incident Labs

The Tail at Scale

If your frontend has a hard dependency on multiple microservices, their failure rates are compounded. This article fills in the math behind the paper The Tail at Scale and shows that your backends’ SLOs may have to be significantly tighter than the frontend’s.

Bill Duncan

Heroku Incident #2021 Follow-up

This post-incident analysis details a case of a hard dependency that needn’t be hard, taking down the Heroku API, along with a fall-back that didn’t work as intended.

Why strace doesn’t work in Docker

I love Julia Evans’s ability to teach me something new that I didn’t realize I didn’t know.

Julia Evans

Outages

GitHub
Let’s Encrypt
- But you’ve automated your renewals, so this totally doesn’t matter, right?
Hulu
Uber Eats
Reddit
Discord

SRE Weekly Issue #217

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

A message from our sponsor, VictorOps:

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues