Articles
You might wonder why I have given almost zero coverage to “AIOps” here, and why my coverage of “anomaly detection” has included heavy skepticism. The reason: I simply haven’t seen any proof that it works.
The FTC’s recent stance on AI sums up my position nicely. If you want your AIOps product covered here, don’t just tell me it works, prove to me that it works.
Michael Atleson — Federal Trade Commission
How? With a safe and repeatable procedure for database migrations involving double-writing.
Lisa Karlin Curtis — incident.io
Push to main
on a new microservice repo and it deploys to production, spins up a slack channel for alerts, invites the CODEOWNERS, creates an on-call rotation, and puts them in it. Wow!
Kiselev Ivan — Better Programming
A routing issue caused widespread packet loss with worldwide impact across many services.
This month’s report had a couple of fascinating incidents, especially the one about source code archive hashes.
Jakub Oleksy — GitHub
Folks from the New York Times used chaos engineering to prepare for the surge of traffic during the US’s presidential election. They share 5 guidelines for effective chaos engineering for big data systems.
Shane Murray — Monte Carlo
Here’s that LFI Conf recap I wanted!
Vanessa Huerta Granda — Jeli
Former Google folks published this guide to help recently laid-off Google SREs integrate with the way SRE is done in the rest of the tech world. There’s an interesting hint about Google’s on-call compensation that I’m going to have to look into.
Murali Suriar and Niall Murphy
A normally conscientious airline captain made a decision he normally would not have, likely owing to severe sleep deprivation.
Admiral Cloudberg