Hey folks, no issue this week as I recover from unexpected eye surgery. Detached retina is serious business. If you see flashing or floaters in your vision, please get it checked out!
General
SRE Weekly Issue #172
Articles
An experienced pilot and programmer details the background behind the 737 MAX’s MCAS system and discusses the risks and motivations involved.
Boeing’s solution to its hardware problem was software.
Thanks to John Goerzen for this one.
Gregory Travis — IEEE Spectrum
A detailed analysis of a paper by Eric Hollnagel and David Woods on designing systems that include humans and computers.
The operator detects failures better when he participates in system control as opposed to functioning only as a monitor…
Thai Wood (summary)
An essay on the difference in philosophies between Safety I and Safety II and on understanding how our systems succeed rather than focusing on how they fail.
Ryan Frantz
Azure’s project tardigrade is exploring interesting ideas like keeping VMs resident in memory even when the host kernel reboots. This reminds me of another similarly-named project.
Chris Kanaracus — TechTarget
This is a followup to an article from last week about a Honeycomb incident, going into more detail on what went wrong and how they figured it out using Honeycomb itself.
Douglas Soo — Honeycomb
On Feb 15th, 2019, a slave node in Redis, an in-memory data structure storage, failed requiring a replacement.
[…]
This blog post describes Grab’s post-mortem findings for the outage caused by the Redis Cluster failure.
· Michael Cartmell, Jiahao Huang, and Sandeep Kumar — Grab
I like how their chosen solution fetches from all the datacenters in the normal case, so they don’t experience a sudden shift in traffic pattern during a failover.
Preslav Le — Dropbox
Outages
- GitHub
- Gmail
- Ankle Bracelets in the Netherlands
- These are the ankle bracelets used to monitor and enforce house arrest.
the Dutch Ministry of Justice and Security had to step in and preemptively arrest and jail some of its most high-risk suspects
- These are the ankle bracelets used to monitor and enforce house arrest.
- Facebook and Instagram
SRE Weekly Issue #171
Articles
TL:DR; Prefer investing in recovery instead of prevention.
Make failure a non-event, rather than trying to prevent it. You won’t succeed in fully preventing failures, and you’ll instead get out of practice of recovering.
Aaron Blohowiak
They had me at “normalization of deviance”. I’ll read pretty much anything with that in the title.
Tim Davies — Fast Jet Performance
Monzo’s system is directly integrated with Slack, helping you manage your incident and track what happens. Check out their video presentation for more details.
Monzo
Me too! Great thread.
Nolan Caudill and others
I love Honeycomb incident reviews, I really do.
Douglas Soo
Born from a Twitter argument thread, this article goes into depth about why Friday change freezes can cause much more trouble than good.
Charity Majors
Outages
- Amazon EC2
- Network-related issues in Japan and Hong Kong (on separate days). It’s starting to become downright impossible to find historical incidents on their mile-long status page.
- Google Hangouts Meet
- Google Cloud Console
- Slack
- Azure, Microsoft 365, and Dynamics 365
- A DNS change went awry, resulting in one of their DNS zone’s four nameservers having an empty copy of the zone and serving NXDOMAIN. This is a really interesting incident report to read. Had the nameserver simply not had the zone at all, it would have returned a non-authoritative answer, and clients would have fallen back to one of the other three nameservers.
- Wells Fargo (bank)
- Discord
- Google AdSense
- Facebook, Instagram, and WhatsApp
- Coles (Supermarket chain)
- Hallifax and Lloyds (banks)
SRE Weekly Issue #170
Articles
This myth is a misguided belief that engineers are like Laplace’s Demon; they maintain an accurate mental model of the system, foresee all the consequences of their actions, predict where the business is going, and are careful enough to avoid mistakes.
Aaron Blohowiak — Netflix
I highly recommend watching some of the talks or at least perusing slides.
The concern is that incidents have been investigated by parties that were involved or related to the incident, raising concerns about conflicts of interest. In a small company, avoiding this kind of thing may not be possible, but we should at least keep the risks in mind.
Patrick Kingsland — Railway Technology
An absolute treasure trove of links to many articles and papers on resilience engineering. Beyond just links, there are short profiles of 30+ important thinkers in the field. I’m going to be busy for awhile.
@lorin (GitHub)
This is about project retrospectives, but it applies equally well to incident retrospectives.
Dominika Bula — Red Hat
Here’s a counterpoint to an article I linked to last week.
Karl Bode — Motherboard
Outages
SRE Weekly Issue #169
Articles
My coworker pointed me toward this article, and we had a really great conversation. I shared this article that I’d linked previously here, and it hit me: Boeing (and the FAA?) assumed MCAS was fine because a failure in it would look like a normal kind of failure with an established recovery procedure.
The problem is, we’ve seen that the recovery procedure can fail if the plane is moving so fast toward the ground that the pilots can’t physically pull it out of a dive. And it seems possible that no one knew that the recovery mechanism had this fatal vulnerability. This has all the hallmarks of a classic complex failure.
Thanks to John Goerzen for this one.
Richard McSpadden — AOPA
Pretty much any thread by Colm MacCárthaigh is a great read.
I think right around this minute is just about exactly 5 years since the Heartbleed vulnerability in OpenSSL became public. I remember the day vividly, and if you’re interested, allow me to tell you about how the day, and the subsequent months, and years unfolded …
Colm MacCárthaigh
Find out why going on call made sense for a Developer Advocate and how it went.
Liz Fong-Jones — Honeycomb
As the BGP route table grows, some devices will soon run out of space to store it all.
Catalin Cimpanu
The risk of logical damage to the data in a DB is the kind of risk that means there’s no such thing as a true rollback (You Can’t Have a Rollback Button).
Benji Weber
Our field is evolving toward adopting resilience engineering, and it’s not an easy process. This post goes into some detail on the mental struggle and points in the direction we need to go to get there.
Will Gallego [Note: Will is my coworker]
Outages
- Gmail Suffers Two-Hour Global Outage: Reports 04/18/2019
- Google Oauth
- Seems like this may have effectively taken down Gmail.
- Grindr
- 1&1 Ionos