SRE Weekly Issue #454

View on sreweekly.com

Nine entire years ago, I threw together a few “issues” with my favorite SRE articles, installed WordPress, and added a subscription form, with no clue what I was doing. It’s only thanks to you folks, the thousands of subscribers and the many authors of great SRE content, that I’ve been able to keep this up for so long. Thank you, you make it fun! And as always, thanks also to my sponsors, former, current, and future, who’ve helped make this whole thing possible.

TTR: the out-of-control metric

When we try to optimize MTTR as if it’s a meaningful statistic, we run into trouble. This article does a great job of explaining why, drawing from concepts and techniques in manufacturing.

Lorin Hochstein

The Case for Shared Storage

This article introduces the concepts of “shared nothing” and “shared storage” in distributed systems and then explains why they chose shared storage for WarpStream.

Richard Artoul — WarpStream

Do Dollars Make Sense for Incident Management?

How much did that incident cost in lost revenue? This article says you should avoid including that number in your incident management process, because it’s a trap.

Tom Webster — Rootly

Breaking down CPU speed: How utilization impacts performance

Pushing a system to 100% CPU utilization can cause workloads to be slowed down. This article is about experimentally finding the sweet spot between utilizing CPUs as much as possible and avoiding performance issues.

Andreas Strikos — GitHub

Solutions to the Lost Update Problem

This article has a couple of strategies for handling concurrent updates to the same row in MySQL, with and without locking.

Sönke Ruempler

How we page ourselves if incident.io goes down

They do it with a dead man’s switch, implemented using a backup alert provider.

Lawrence Jones — incident.io

The long way towards resilience – Part 6

I came across part 6 first and I need to go back and read the rest, but I just had to share this now, because if the cool concept it contains: that efficiency and resiliency are at odds with each other.

Uwe Friedrichsen

Keeping User Journey SLOs Up-to-Date with E2E Testing in a Microservices Architecture

This is so cool! Their system automatically figures out which API calls are critical to each user journey and keeps the list updated.

yakenji — Mercari

SRE Weekly Issue #454

Subscribe

RSS

Mastodon

Search Issues

A message from our sponsor, FireHydrant:

Subscribe

RSS

Mastodon

Search Issues