Cloudflare reduced their cold-start rate for Workers requests through sharding and consistent hashing, with an interesting solution for load shedding.
Harris Hancock — Cloudflare
I appreciate the way this article also shares how each of logs, metrics, traces, and alerts has its downsides, and what you can do instead. FYI, there’s also a fairly extensive product-specific second half about observabilty on Railway.
Mahmoud Abdelwahab — Railway
I don’t often include direct product introductions like this explanation of Uptime Labs’s incident simulation platform from Adaptive Capacity Labs. I’m making an exception in this case because I feel that incident simulation has huge potential to improve reliability, and I see very few articles about it.
John Allspaw — Adaptive Capacity Labs
IaC may bring more trouble than it solves, and it may simply move or hide complexity, according to this article.
RoseSecurity
[…] the failure gap, which is the idea that people vastly underestimate the actual number and rate of failures that happen in the world compared to successes.
Fred Hebert — summary
Lauren Eskreis-Winkler, Kaitlin Woolley, Minhee Kim, and Eliana Polimeni — original paper
This one’s fun. You get to play along with the author, voting on an error handling strategy and then seeing what the author thinks and why.
Marc Brooker
A chronicle of an sandboxed experiment in using multiple instances of Claude to investigate incidents. I like the level of detail and transparency in their experimental setup.
Ar Hakboian — OpsWorker.ai
I have a bit of an article backlog, so note that this is about the November outage, not the more recent outage on December 5.
Lorin Hochstein
