I’m back! Kidney donation was a fascinating and rewarding experience, and I encourage you to learn more. It’s amazing how it’s possible to fix one human with spare parts from another!
I’ll share more about my experience later, but for now: thank you to the many of you that reached out with well-wishes. I’m feeling great and recovering nicely. I used the National Kidney Registry’s Voucher Program, allowing me to donate my kidney now and complete my healing while the NKR works to find a blood-type matched kidney for my intended recipient. It’s an incredible system.
I’m slowly catching up on the many SRE-related articles posted during my hiatus. If you’ve sent me links, thank you so much, and please understand that I’m woefully behind on my inbox, but I’ll review your suggestion soon!
Human error? Perhaps, but there were multiple compounding factors in this airplane incident, including sleep debt, circadian rhythms, an inoperative thrust reverser, and normalization of deviance.
David Kaminski-Morrow — FlightGlobal
This is a technical report on three bugs that intermittently degraded responses from Claude. Below we explain what happened, why it took time to fix, and what we’re changing.
I especially like the section, “Why detection was difficult”.
Anthropic
While I was out, I definitely heard about the bit AWS us-east-1 outage! Here’s Amazon’s write-up of the incident, involving a latent race condition.
Amazon
I really love this analysis of the AWS us-east-1 outage. It’s Lorin’s Law once again: an infrastructure feature designed to improve reliability is implicated in an incident.
Lorin Hochstein
Ouch! We should exercise caution when ascribing actions like “lying” and “covering tracks” to LLM-based agents — and of course when giving such agents deep access to modify our systems.
Bruce Gil — Gizmodo
This post delves into our journey to transform incident management from a centralized function into a widespread, accessible practice and the hard-won lessons we’ve learned along the way.
They built a paved path based on Incident.io that any of their teams could use to manage an incident.
Molly Struve — Netflix
If someone did something wrong, then it’s vital to understand why they did it.
My favorite part of this article is the common list of reasons people violate procedures.
NorthStandard
A detailed description of Cloudflare’s new R2 SQL service that provides serverless querying across data in their object store service. This article helped me understand things I hadn’t really grasped before about how columnar datastores work.
Yevgen Safronov, Nikita Lapkov, Jérôme Schneider — Cloudflare
