Sometimes, we can harness randomness to improve throughput and reliability.
Teiva Harsanyi — The Coder Cafe
Not just the “how”, but also the “why”, along with the challenges they found along the way.
Daniel Paulus and Umut Uzgur — Checkly
It’s a classic problem: how do you detect problems that badly impact a specific set of customers, when the overall percentage affected is tiny?
Lakshmi Narayan and Joshua Delman — Stripe
This is the clearest and most concise explanation of the Byzantine Generals Problem that I’ve read.
Sid — The Scalable Thread
Th[is] article describes some different methods and tools that engineers can use to simulate their clusters and what knowledge they can gain from it, and it presents a case study using SimKube, the Kubernetes simulator developed by Applied Computing Research Labs in 2024.
David R. Morrison — ACM Queue
An IaaC nightmare: when a list went from having IPs to being empty, suddenly the IP block rule was interpreted as “block everything” rather than “block nothing”.
Jake Cooper — Railway
The incident occurred due to human error and insufficient validation safeguards during a routine abuse remediation for a report about a phishing site hosted on R2.
Matt Silverlock and Javier Castro — Cloudflare
Along with being blatantly illegal, DOGE’s actions are incredibly risky from a reliability perspective. Thanks, Liz, for putting into words concerns that I also share.
Liz Fong-Jones — Bulletin of the Atomic Scientists