As we learn advanced resilience engineering concepts, this article recommends that we take a balanced approach in how we try to change existing practices.
I can confidently say that when an executive leader wants to be talking about quality of service for your customers, the last thing they want to hear about is academic papers and Monte Carlo simulations.
Michelle Casey — Resilience in Software Foundation
I know you probably know all about how hashing works, but this one’s still worth a read. The article includes interactive demonstrations and clearly presents concepts to help you understand how hashing function performance is evaluated.
Sam Rose
Pulled from the Internet Archive, here’s a story of how the now-defunct Parse rewrote their Ruby on Rails API in Golang, significantly improving reliability.
Charity Majors
We are sharing methodologies we deploy at various scales for detecting SDC [Silent Data Corruption] across our AI and non-AI infrastructure to help ensure the reliability of AI training and inference workloads across Meta.
Harish Dattatraya Dixit and Sriram Sankar — Meta
As monday.com broke their monolith up into microservices, their number of databases expanded too. To have a chance of managing all of them, they shifted from DBA practices to DBRE.
Mateusz Wojciechowski — monday.com
Airbnb runs a large-scale database on Kubernetes. They have various techniques to deal with the ephemerality of pods and the risks inherent in cluster upgrades.
Artem Danilov — Airbnb
The author of this article brings us along as they do a very thorough evaluation of K8sGPT, showing us what it can do and some ways in which it can fall short.
Evgeny Torin — Palark
What is good incident communication? This article draws on theory from Herbert Clark’s Joint Action Ladder to help us evaluate and strengthen communication.
Stuart Rimell — Uptime Labs