SRE Weekly Issue #171

A message from our sponsor, VictorOps:

[You’re Invited] Puppet, Splunk and VictorOps are teaming up for a live webinar on powering continuous improvement by combining analytics, incident response and automation. Learn best practices for releasing better applications faster, without the fire drills.

http://try.victorops.com/sreweekly/continuous-improvement-webinar

Articles

TL:DR; Prefer investing in recovery instead of prevention.

Make failure a non-event, rather than trying to prevent it. You won’t succeed in fully preventing failures, and you’ll instead get out of practice of recovering.

Aaron Blohowiak

They had me at “normalization of deviance”. I’ll read pretty much anything with that in the title.

Tim Davies — Fast Jet Performance

Monzo’s system is directly integrated with Slack, helping you manage your incident and track what happens. Check out their video presentation for more details.

Monzo

Me too! Great thread.

Nolan Caudill and others

I love Honeycomb incident reviews, I really do.

Douglas Soo

Born from a Twitter argument thread, this article goes into depth about why Friday change freezes can cause much more trouble than good.

Charity Majors

Outages

SRE Weekly Issue #170

A message from our sponsor, VictorOps:

Our latest list of the top 12 server monitoring tools can help your SRE team get started in building a comprehensive monitoring strategy. Drive deeper service reliability through effective server monitoring:

http://try.victorops.com/sreweekly/top-server-monitoring-software

Articles

This myth is a misguided belief that engineers are like Laplace’s Demon; they maintain an accurate mental model of the system, foresee all the consequences of their actions, predict where the business is going, and are careful enough to avoid mistakes.

Aaron Blohowiak — Netflix

I highly recommend watching some of the talks or at least perusing slides.

The concern is that incidents have been investigated by parties that were involved or related to the incident, raising concerns about conflicts of interest. In a small company, avoiding this kind of thing may not be possible, but we should at least keep the risks in mind.

Patrick Kingsland — Railway Technology

An absolute treasure trove of links to many articles and papers on resilience engineering. Beyond just links, there are short profiles of 30+ important thinkers in the field. I’m going to be busy for awhile.

@lorin (GitHub)

This is about project retrospectives, but it applies equally well to incident retrospectives.

Dominika Bula — Red Hat

Here’s a counterpoint to an article I linked to last week.

Karl Bode — Motherboard

Outages

SRE Weekly Issue #169

A message from our sponsor, VictorOps:

[Last Chance] Death to Downtime: How to Quantify and Mitigate the True Costs of Downtime. VictorOps and Catchpoint are teaming up for a live webinar on 5 monitoring and incident response best practices for preventing outages.

http://try.victorops.com/sreweekly/death-to-downtime-webinar

Articles

My coworker pointed me toward this article, and we had a really great conversation. I shared this article that I’d linked previously here, and it hit me: Boeing (and the FAA?) assumed MCAS was fine because a failure in it would look like a normal kind of failure with an established recovery procedure.

The problem is, we’ve seen that the recovery procedure can fail if the plane is moving so fast toward the ground that the pilots can’t physically pull it out of a dive. And it seems possible that no one knew that the recovery mechanism had this fatal vulnerability. This has all the hallmarks of a classic complex failure.

Thanks to John Goerzen for this one.

Richard McSpadden — AOPA

Pretty much any thread by Colm MacCárthaigh is a great read.

I think right around this minute is just about exactly 5 years since the Heartbleed vulnerability in OpenSSL became public. I remember the day vividly, and if you’re interested, allow me to tell you about how the day, and the subsequent months, and years unfolded …

Colm MacCárthaigh

Find out why going on call made sense for a Developer Advocate and how it went.

Liz Fong-Jones — Honeycomb

As the BGP route table grows, some devices will soon run out of space to store it all.

Catalin Cimpanu

The risk of logical damage to the data in a DB is the kind of risk that means there’s no such thing as a true rollback (You Can’t Have a Rollback Button).

Benji Weber

Our field is evolving toward adopting resilience engineering, and it’s not an easy process. This post goes into some detail on the mental struggle and points in the direction we need to go to get there.

Will Gallego [Note: Will is my coworker]

Outages

SRE Weekly Issue #168

A message from our sponsor, VictorOps:

[You’re Invited] Death to Downtime: How to Quantify and Mitigate the True Costs of Downtime. VictorOps and Catchpoint are teaming up for a live webinar on 5 monitoring and incident response best practices for preventing outages.

http://try.victorops.com/sreweekly/death-to-downtime-webinar

Articles

This one’s great for folks that are new to SRE, and it’s also an enlightening read for seasoned SREs. What caught me most was the Definition section, on what it means to be an SRE.

Alice Goldfuss

In this articlization of a conference talk, the author lays out 8 common pitfalls in chaos engineering, with detailed example stories related to them. It goes much deeper than mere chaos engineering into the theory of how to operate complex systems.

Nora Jones

Automation can have unintended effects — and can tend to not have the effect we hope it will.

Thanks to Greg Burek for this one.

Courtney Nash

Recently having binged watch Air Emergency, I felt that SREs can learn many things from aviation industry.

Anshul Patel

Lessons learned by a software engineer on supporting their code in production.

Kashyap Kondamudi

CASE stands for Context-heavy, Actionable, Symptom-based, and Evaluated. That last one’s really key. The author proposes setting an expiration time for your alerts after which time you should evaluate them to make sure that they still make sense.

Cory Watson

Outages

  • Heroku: (EU) routing issues for ssl:endpoint applications
    • Heroku posted this followup for an outage on April 2.
  • The Travis CI Blog: Incident review for slow booting Linux builds outage
    • The outage happened March 27-28.
  • Azure VMs — North Central US
    • Since deep-linking to Azure incident summaries doesn’t work and this one is especially interesting, I’ll quote it here:

      Azure Storage team made a configuration change on 9 April 2019 at 21:30 UTC to our back-end infrastructure in North Central US to improve performance and latency consistency for Azure Disks running inside Azure Virtual Machines. This change was designed to be transparent to customers. It was enabled following our normal deployment process, first to our test environment, and lower impact scale units before being rolled out to the North Central US region. However, this region hit bugs which impacted customer VM availability. Due to a bug, VM hosts were able to establish session with the storage scale unit but hit issues when trying to receive/send data from/to storage scale unit. This situation was designed to be handled with fallback to our existing data path, but an additional bug led to failure in the fallback path and resulted in in VM reboots.

  • Facebook, Instagram, and WhatsApp

SRE Weekly Issue #167

A message from our sponsor, VictorOps:

[You’re Invited] Death to Downtime: How to Quantify and Mitigate the True Costs of Downtime. VictorOps and Catchpoint are teaming up for a live webinar on 5 monitoring and incident response best practices for preventing outages.

http://try.victorops.com/sreweekly/death-to-downtime-webinar

Articles

This is an awesome write-up of SRECon, but the part I really love is the intro. The author gives voice to a growing tension I’ve seen in our field, as we try to adopt the tenets of Safety II which can seem to be at odds with traditional SRE practices. There’s a lot here that we SREs need to work out as our profession matures, and I’m really enjoying the process.

Tanya Reilly

Experts recommend trying to keep the concepts of blame, root cause, and hindsight bias out of our retrospective investigations. This insightful article explains that they all stem from the illusion that we are in full control of our systems.

Thanks to Will Gallego for this one.

Ryan Frantz

Here’s a top-notch followup analysis from Mailchimp on the Mandrill outage last month. Their Postgresql DB ran out of transaction IDs (a common failure mode), causing a painful outage. Tons of great stuff here including a mention of rotating ICs every 3 hours to prevent exhaustion and allow them to sleep.

Mailchimp

And here’s where things get really interesting. Incidents are never as simple as they seem from the outside, and the 737 MAX situation is no exception. I anxiously await the full report, in which we’ll hear more about the confluence of contributing factors that must have been involved here.

Thom Patterson — CNN

There’s a lot in this, and I don’t feel comfortable summarizing it with a little blurb about lessons learned. Chilling though it is, I’m glad I read it.

Thanks to Sri Ray for this one.

Patrick Smith — The Telegraph

I consider a system to production ready when it has, not error handling inside a particular component, but actual dedicated components related to failure handling (note the difference from error handling), management of failures and its mitigations.

Ayende Rahien

Outages

A production of Tinker Tinker Tinker, LLC Frontier Theme