A thoughtful framework for evaluating the risk in using AI coding tools, centering around the probability, detectability, and impact of errors.
Birgitta Böckeler — martinfowler.com
Cloudflare does some really fascinating things with networking. Here’s a deep dive on how they solved a problem in their implementation of sharing IP addresses across machines.
Chris Branch — Cloudflare
I especially like how they nail down what exactly counts as “zero downtime” in the migration. They did allow some kinds of degradation.
Anna Dowling — Tines
We’re always making tradeoffs in our systems (and companies). Incidents can help us see whether we’re making the right ones and how our decisions have played out.
Fred Hebert
Fixation on a plan, on a model of the system, or on a theory of the cause, is a major risk in incident response.
Lorin Hochstein
how do you design a system with events that have different SLO requirements?
They added a proxy layer on the consumer side to allow parallel processing within partitions, to avoid head-of-line blocking.
Rohit Pathak, Tanya Fesenko, Collin Crowell, and Dmitry Mamyrin — Klaviyo
A database schema change was unintentionally reverted, and a subsequent thundering herd exacerbated the impact.
Ray Chen — Railway
Recently, we had to upgrade a heavily loaded PostgreSQL cluster from version 13 to 16 while keeping downtime minimal. The cluster, consisting of a master and a replica, was handling over 20,000 transactions per second.
Timur Nizamutdinov — Palark
