A thoughtful evaluation of current trends in AI through the lens of Lisanne Bainbridge’s classic paper, The Ironies of Automation. I really got a lot out of this one.
Uwe Friedrichsen
They supercharged the workflow engine by rewriting it. I like the way they explained why they settled on a full rewrite and the alternative options they considered.
Jun He, Yingyi Zhang, and Ely Spears — Netflix
This one goes deep on how to build a reliable service on unreliable parts. Can retries improve your overall reliability? What about the reliability of the retry system itself?
Warren Parad — Authress
In this article, we’ll explore how cold-restart dependencies form, why typical recovery designs break down, and what architectural principles can help systems warm up faster after a complete outage.
Bala Kambala
This one goes into the qualities of a good post-incident review, the definition of resilience, and a discussion of blamelessness, drawing lessons from aviation.
Gamunu Balagalla — Uptime Labs
It would be easy to blame the poor outcome of BOAC 712’s engine failure on human error since the pilots missed key steps in their checklists. Instead, the NTSB cited systemic issues, resulting in improvements in checklists and other areas.
Mentour Pilot
Cloudflare had another significant outage, though not as big as the one last month. This one was related to steps they took to mitigate the big React RCE vulnerability.
Dane Knecht — Cloudflare
Lorin’s whole analysis is awesome, but there’s an especially incisive section at the end that uses math to put Cloudflare’s run of 2 recent big incidents in perspective.
Lorin Hochstein
