SRE Weekly Issue #497

A thoughtful framework for evaluating the risk in using AI coding tools, centering around the probability, detectability, and impact of errors.

Birgitta Böckeler — martinfowler.com

So long, and thanks for all the fish- how to escape the Linux networking stack

Cloudflare does some really fascinating things with networking. Here’s a deep dive on how they solved a problem in their implementation of sharing IP addresses across machines.

Chris Branch — Cloudflare

Zero downtime database migrations: Lessons from moving a live production database

I especially like how they nail down what exactly counts as “zero downtime” in the migration. They did allow some kinds of degradation.

Anna Dowling — Tines

Ongoing Tradeoffs, and Incidents as Landmarks

We’re always making tradeoffs in our systems (and companies). Incidents can help us see whether we’re making the right ones and how our decisions have played out.

Fred Hebert

Fixation: the ever-present risk during incident handling

Fixation on a plan, on a model of the system, or on a theory of the cause, is a major risk in incident response.

Lorin Hochstein

Building a Distributed Priority Queue in Kafka

how do you design a system with events that have different SLO requirements?

They added a proxy layer on the consumer side to allow parallel processing within partitions, to avoid head-of-line blocking.

Rohit Pathak, Tanya Fesenko, Collin Crowell, and Dmitry Mamyrin — Klaviyo

Incident Report: September 22nd, 2025

A database schema change was unintentionally reverted, and a subsequent thundering herd exacerbated the impact.

Ray Chen — Railway

Upgrading PostgreSQL with no data loss and minimal downtime

Recently, we had to upgrade a heavily loaded PostgreSQL cluster from version 13 to 16 while keeping downtime minimal. The cluster, consisting of a master and a replica, was handling over 20,000 transactions per second.

Timur Nizamutdinov — Palark

SRE Weekly Issue #497

Subscribe

RSS

Mastodon

Search Issues

A message from our sponsor, Costory:

Subscribe

RSS

Mastodon

Search Issues