SRE Weekly Issue #502

Eliminating Cold Starts 2: shard and conquer

Cloudflare reduced their cold-start rate for Workers requests through sharding and consistent hashing, with an interesting solution for load shedding.

Harris Hancock — Cloudflare

Monitoring & Observability: Using Logs, Metrics, Traces, and Alerts to Understand System Failures

I appreciate the way this article also shares how each of logs, metrics, traces, and alerts has its downsides, and what you can do instead. FYI, there’s also a fairly extensive product-specific second half about observabilty on Railway.

Mahmoud Abdelwahab — Railway

Uptime Labs: Building Expertise in Incident Response

I don’t often include direct product introductions like this explanation of Uptime Labs’s incident simulation platform from Adaptive Capacity Labs. I’m making an exception in this case because I feel that incident simulation has huge potential to improve reliability, and I see very few articles about it.

John Allspaw — Adaptive Capacity Labs

KISS vs DRY in Infrastructure as Code: Why Simple Often Beats Clever

IaC may bring more trouble than it solves, and it may simply move or hide complexity, according to this article.

RoseSecurity

Paper: The Failure Gap

[…] the failure gap, which is the idea that people vastly underestimate the actual number and rate of failures that happen in the world compared to successes.

Fred Hebert — summary

Lauren Eskreis-Winkler, Kaitlin Woolley, Minhee Kim, and Eliana Polimeni — original paper

What Now? Handling Errors in Large Systems

This one’s fun. You get to play along with the author, voting on an error handling strategy and then seeing what the author thinks and why.

Marc Brooker

Agent-Driven SRE Investigations: A Practical Deep Dive into Multi-Agent Incident Response

A chronicle of an sandboxed experiment in using multiple instances of Claude to investigate incidents. I like the level of detail and transparency in their experimental setup.

Ar Hakboian — OpsWorker.ai

Brief thoughts on the recent Cloudflare outage

I have a bit of an article backlog, so note that this is about the November outage, not the more recent outage on December 5.

Lorin Hochstein

SRE Weekly Issue #502

Subscribe

RSS

Mastodon

Search Issues