SRE Weekly Issue #365

They take us from the requirements analysis all the way through implementation of a high-throughput data store based on CockroachDB.

  Chuanpin Zhu and Debalin Das โ€” DoorDash

On March 14th, Reddit engineers upgraded a Kubernetes cluster from 1.23 to 1.24, and all hell broke loose. I admire their precision in being down for 100ฯ€ minutes.

  Jayme Howard โ€” Reddit

With a huge user-base of students and teachers, these folks upped their incident response game, and they share how.

ย ย Nadinastiti and Estu Fardani โ€” GovTech Edu

A lurking bug in redis-py allowed users to see one another’s data, and OpenAI took ChatGPT down to limit the damage.


In Linux, source port allocation can be complex. This article shows why with a ton of code and tracing examples.

  Jakub Sitnicki โ€” Cloudflare

The gap between “paying for peak” and “earning on average” is critical to understand how the economics of large-scale cloud systems differ from traditional single-tenant systems.

  Marc Brooker

A configuration error was masked because the app automatically fell back to the original configuration. The problem only surfaced when the service was redeployed.


