SRE Weekly Issue #365

A message from our sponsor, Rootly:

Manage incidents directly from Slack with Rootly ๐Ÿš’.

Rootly automates manual tasks like creating an incident channel, Jira ticket and Zoom rooms, inviting responders, creating statuspage updates, postmortem timelines and more. Want to see why companies like Canva and Grammarly love us?:


They take us from the requirements analysis all the way through implementation of a high-throughput data store based on CockroachDB.

  Chuanpin Zhu and Debalin Das โ€” DoorDash

On March 14th, Reddit engineers upgraded a Kubernetes cluster from 1.23 to 1.24, and all hell broke loose. I admire their precision in being down for 100ฯ€ minutes.

  Jayme Howard โ€” Reddit

With a huge user-base of students and teachers, these folks upped their incident response game, and they share how.

ย ย Nadinastiti and Estu Fardani โ€” GovTech Edu

A lurking bug in redis-py allowed users to see one another’s data, and OpenAI took ChatGPT down to limit the damage.


In Linux, source port allocation can be complex. This article shows why with a ton of code and tracing examples.

  Jakub Sitnicki โ€” Cloudflare

The gap between “paying for peak” and “earning on average” is critical to understand how the economics of large-scale cloud systems differ from traditional single-tenant systems.

  Marc Brooker

A configuration error was masked because the app automatically fell back to the original configuration. The problem only surfaced when the service was redeployed.


Updated: March 26, 2023 — 11:21 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme