Cloudflare had a major incident this week, and they say it was their worst since 2019. In this report, they explain what happened, and the failure mode is pretty interesting.
Matthew Prince — Cloudflare
How we completely rearchitected Mussel, our storage engine for derived data, and lessons learned from the migration from Mussel V1 to V2.
They cover not just the motivation for and improvements in V2, but also the migration process to deploy V2 without interruption.
Shravan Gaonkar — Airbnb
Netflix’s WAL service acts as a go-between, streaming data to pluggable targets while providing extra functionality like retries, delayed sending, and a dead-letter queue.
Prudhviraj Karumanchi, Samuel Fu, Sriram Rangarajan, Vidhya Arvind, Yun Wang, and John Lu — Netflix
A (very) deep dive into Datadog’s custom data store, with special attention to how it handles query planning and optimization.
Sami Tabet — Datadog
Perhaps we should encourage people to write their incident reports as if they will be consumed by an AI SRE tool that will use it to learn as much as possible about the work involved in diagnosing and remediating incidents in your company.
Lorin Hochstein
we landed on a two-level failure capture design that combines Kafka topics with an S3 backup to ensure no event is ever lost.
Tanya Fesenko, Collin Crowell, Dmitry Mamyrin, and Chinmay Sawaji — Klaviyo
Buried in this one is this gem: the last layer of reliability is that their client library automatically retries to alternate regions if the main region fails.
Paddy Byers — Ably
incident.io shares details on how they fared during the AWS us-east-1 incident on October 20.
Pete Hamilton — incident.io
