TL:DR; Prefer investing in recovery instead of prevention.
Make failure a non-event, rather than trying to prevent it. You won’t succeed in fully preventing failures, and you’ll instead get out of practice of recovering.
They had me at “normalization of deviance”. I’ll read pretty much anything with that in the title.
Tim Davies — Fast Jet Performance
Monzo’s system is directly integrated with Slack, helping you manage your incident and track what happens. Check out their video presentation for more details.
Me too! Great thread.
Nolan Caudill and others
I love Honeycomb incident reviews, I really do.
Born from a Twitter argument thread, this article goes into depth about why Friday change freezes can cause much more trouble than good.
- Amazon EC2
- Network-related issues in Japan and Hong Kong (on separate days). It’s starting to become downright impossible to find historical incidents on their mile-long status page.
- Google Hangouts Meet
- Google Cloud Console
- Azure, Microsoft 365, and Dynamics 365
- A DNS change went awry, resulting in one of their DNS zone’s four nameservers having an empty copy of the zone and serving NXDOMAIN. This is a really interesting incident report to read. Had the nameserver simply not had the zone at all, it would have returned a non-authoritative answer, and clients would have fallen back to one of the other three nameservers.
- Wells Fargo (bank)
- Google AdSense
- Facebook, Instagram, and WhatsApp
- Coles (Supermarket chain)
- Hallifax and Lloyds (banks)