SRE Weekly Issue #454

Nine entire years ago, I threw together a few “issues” with my favorite SRE articles, installed WordPress, and added a subscription form, with no clue what I was doing. It’s only thanks to you folks, the thousands of subscribers and the many authors of great SRE content, that I’ve been able to keep this up for so long. Thank you, you make it fun! And as always, thanks also to my sponsors, former, current, and future, who’ve helped make this whole thing possible.

A message from our sponsor, FireHydrant:

Why migrate from PagerDuty? Empower team-level ownership, reduce costs, decouple alerts from incidents, automate incidents end-to-end…to name a few. Join the growing list of companies that have made the switch. (p.s. our Signals migrator makes it simple)

https://firehydrant.com/migrate/from-pagerduty/

When we try to optimize MTTR as if it’s a meaningful statistic, we run into trouble. This article does a great job of explaining why, drawing from concepts and techniques in manufacturing.

  Lorin Hochstein

This article introduces the concepts of “shared nothing” and “shared storage” in distributed systems and then explains why they chose shared storage for WarpStream.

  Richard Artoul — WarpStream

How much did that incident cost in lost revenue? This article says you should avoid including that number in your incident management process, because it’s a trap.

  Tom Webster — Rootly

Pushing a system to 100% CPU utilization can cause workloads to be slowed down. This article is about experimentally finding the sweet spot between utilizing CPUs as much as possible and avoiding performance issues.

  Andreas Strikos — GitHub

This article has a couple of strategies for handling concurrent updates to the same row in MySQL, with and without locking.

  Sönke Ruempler

They do it with a dead man’s switch, implemented using a backup alert provider.

  Lawrence Jones — incident.io

I came across part 6 first and I need to go back and read the rest, but I just had to share this now, because if the cool concept it contains: that efficiency and resiliency are at odds with each other.

  Uwe Friedrichsen

This is so cool! Their system automatically figures out which API calls are critical to each user journey and keeps the list updated.

  yakenji — Mercari

Updated: December 8, 2024 — 9:30 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme