SRE Weekly Issue #169

A message from our sponsor, VictorOps:

[Last Chance] Death to Downtime: How to Quantify and Mitigate the True Costs of Downtime. VictorOps and Catchpoint are teaming up for a live webinar on 5 monitoring and incident response best practices for preventing outages.


My coworker pointed me toward this article, and we had a really great conversation. I shared this article that I’d linked previously here, and it hit me: Boeing (and the FAA?) assumed MCAS was fine because a failure in it would look like a normal kind of failure with an established recovery procedure.

The problem is, we’ve seen that the recovery procedure can fail if the plane is moving so fast toward the ground that the pilots can’t physically pull it out of a dive. And it seems possible that no one knew that the recovery mechanism had this fatal vulnerability. This has all the hallmarks of a classic complex failure.

Thanks to John Goerzen for this one.

Richard McSpadden — AOPA

Pretty much any thread by Colm MacCárthaigh is a great read.

I think right around this minute is just about exactly 5 years since the Heartbleed vulnerability in OpenSSL became public. I remember the day vividly, and if you’re interested, allow me to tell you about how the day, and the subsequent months, and years unfolded …

Colm MacCárthaigh

Find out why going on call made sense for a Developer Advocate and how it went.

Liz Fong-Jones — Honeycomb

As the BGP route table grows, some devices will soon run out of space to store it all.

Catalin Cimpanu

The risk of logical damage to the data in a DB is the kind of risk that means there’s no such thing as a true rollback (You Can’t Have a Rollback Button).

Benji Weber

Our field is evolving toward adopting resilience engineering, and it’s not an easy process. This post goes into some detail on the mental struggle and points in the direction we need to go to get there.

Will Gallego [Note: Will is my coworker]


Updated: April 21, 2019 — 9:06 pm
SRE WEEKLY © 2015 Frontier Theme