SRE Weekly Issue #502

Cloudflare reduced their cold-start rate for Workers requests through sharding and consistent hashing, with an interesting solution for load shedding.

  Harris Hancock — Cloudflare

I appreciate the way this article also shares how each of logs, metrics, traces, and alerts has its downsides, and what you can do instead. FYI, there’s also a fairly extensive product-specific second half about observabilty on Railway.

  Mahmoud Abdelwahab — Railway

I don’t often include direct product introductions like this explanation of Uptime Labs’s incident simulation platform from Adaptive Capacity Labs. I’m making an exception in this case because I feel that incident simulation has huge potential to improve reliability, and I see very few articles about it.

  John Allspaw — Adaptive Capacity Labs

IaC may bring more trouble than it solves, and it may simply move or hide complexity, according to this article.

  RoseSecurity

[…] the failure gap, which is the idea that people vastly underestimate the actual number and rate of failures that happen in the world compared to successes.

  Fred Hebert — summary

  Lauren Eskreis-Winkler, Kaitlin Woolley, Minhee Kim, and Eliana Polimeni — original paper

This one’s fun. You get to play along with the author, voting on an error handling strategy and then seeing what the author thinks and why.

  Marc Brooker

A chronicle of an sandboxed experiment in using multiple instances of Claude to investigate incidents. I like the level of detail and transparency in their experimental setup.

  Ar Hakboian — OpsWorker.ai

I have a bit of an article backlog, so note that this is about the November outage, not the more recent outage on December 5.

  Lorin Hochstein

Updated: December 21, 2025 — 10:13 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme