SRE Weekly Issue #171

Articles

TL:DR; Prefer investing in recovery instead of prevention.

Make failure a non-event, rather than trying to prevent it. You won’t succeed in fully preventing failures, and you’ll instead get out of practice of recovering.

Aaron Blohowiak

How I Almost Destroyed a £50 million War Plane and The Normalisation of Deviance.

They had me at “normalization of deviance”. I’ll read pretty much anything with that in the title.

Tim Davies — Fast Jet Performance

Monzo’s real-time incident response and reporting tool

Monzo’s system is directly integrated with Slack, helping you manage your incident and track what happens. Check out their video presentation for more details.

Monzo

Nolan Caudill on Twitter: “I think postmortem docs are an underused avenue for recruiting engineers.”

Me too! Great thread.

Nolan Caudill and others

Incident Review: Caches are Good, Except When They Are Bad

I love Honeycomb incident reviews, I really do.

Douglas Soo

Friday Deploy Freezes Are Exactly Like Murdering Puppies

Born from a Twitter argument thread, this article goes into depth about why Friday change freezes can cause much more trouble than good.

Charity Majors

Outages

Amazon EC2
- Network-related issues in Japan and Hong Kong (on separate days). It’s starting to become downright impossible to find historical incidents on their mile-long status page.
Google Hangouts Meet
Google Cloud Console
Slack
Azure, Microsoft 365, and Dynamics 365
- A DNS change went awry, resulting in one of their DNS zone’s four nameservers having an empty copy of the zone and serving NXDOMAIN. This is a really interesting incident report to read. Had the nameserver simply not had the zone at all, it would have returned a non-authoritative answer, and clients would have fallen back to one of the other three nameservers.
Wells Fargo (bank)
Discord
Google AdSense
Facebook, Instagram, and WhatsApp
Coles (Supermarket chain)
Hallifax and Lloyds (banks)

SRE Weekly Issue #171

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

A message from our sponsor, VictorOps:

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues