The SRE book has a chapter covering on-call, but it’s best suited for huge-scale companies. What should the rest of us do?
If you’re feeling hesitant about chaos engineering, or you’re trying to convince someone who is, this might be useful. The myths are:
Myth #1: Chaos engineering is testing in production
Myth #2: Chaos engineering is about randomly breaking things
Myth #3: Chaos engineering is only for large, modern distributed systems
Myth #4: We don’t need more chaos – we already have plenty!
Myth #5: Chaos engineering is only for very mature teams/products
Drawing parallels to the high modernism movement during the cold war, this article raises interesting questions about the direction SRE is going, and system administration in general.
Laura Nolan — USENIX
Riffing off of a tweet by Charity Majors, this article explores the idea that moving faster can actually be safer, despite an urge one may feel to slow down.
An extreme oversimplification of this incident would be: multiple engine failure on a plane subsequent to a maintenance error on all engines. This accident is cited as a reason to have separate mechanics work on each engine, in hopes of avoiding duplicated errors.
US National Transportation Safety Board (multiple authors)
[…] in order to ship new features and improvements faster while lowering the risk in our deployments, we have a simple but powerful tool: feature flags.
Alberto Gimeno — GitHub
This one blew my mind. By recording instruction execution traces in a ring buffer, they’re able to reconstruct enough information to step through the execution leading up to a crash — even though they weren’t running the application under a debugger!
Walter Erquinigo, David Carrillo-Cisneros, Alston Tang — Facebook
Automation is supposed to take some of the load off of the human operator, right? But in reality, humans need to build a mental model of what the automation is doing in order to use it safely and effectively.
Shem Malmquist — WIRED