SRE Weekly Issue #10


This week’s issue is packed with really meaty articles, which is a nice departure from last week’s somewhat sparse issue.

Articles

So much about what modern medicine has learned about system failures applies directly to SRE, usually without any adaptation required. In this edition of The Morning Paper, Adrian Colyer gives us his take on an excellent short paper by an MD. My favorite quotes:

Hindsight bias remains the primary obstacle to accident investigation, especially when expert human performance is involved.

When new technologies are used to eliminate well understood system failures or to gain high precision performance they often introduce new pathways to large scale, catastrophic failures.

The Software Evolution & Architecture Lab of the University of Zurich is doing a study on modern continuous deployment practices. I’m really interested to see the results, especially where CD meets reliability, so if you have a moment, please hop on over and add your answers. Thanks to Gerald Schermann at UZH for reaching out to me for this.

I’ve been debating with myself whether or not to link to Emerson Network Power’s survey of datacenter outage costs and causes. The report itself is mostly just uninteresting numbers and it’s behind a signup-wall. However, this article is a good summary of the report and links in other interesting stats.

Facebook algorithmically generated hundreds of millions of custom-tailored video montages for its birthday celebration. How they did it without dedicating specific hardware to the task and without impacting production is a pretty interesting read.

Administering ElasticSearch can be just as complicated and demanding as MySQL. This article has an interesting description of SignalFX’s method for resharding without downtime.

This is a pretty interesting report that I’d never heard of before. It’s long (60 pages), but worth the read for a few choice tidbits. For example, I’ve seen this over and over in my career:

Yet, delayed migrations jeopardize business productivity and effectiveness, as companies experience poor system performance or postpone replacement of hardware past its shelf life.

Also, I was surprised that even now, over 70% of respondents said they still use “Tape Backup / Off-site Storage”. I wonder if people are lumping S3 into that.

Never miss an ack or you’ll be in even worse trouble.

More on last week’s outage. I have to figure “voltage regular” means power supply. Everyone hates simultaneous failure.

A full seven years after they started migration, Netflix announced this week that their streaming service is now entirely run out of AWS. That may seem like a long time until you realize that Netflix took a comprehensive approach to the migration:

Arguably, the easiest way to move to the cloud is to forklift all of the systems, unchanged, out of the data center and drop them in AWS. But in doing so, you end up moving all the problems and limitations of the data center along with it. Instead, we chose the cloud-native approach, rebuilding virtually all of our technology and fundamentally changing the way we operate the company.

Outages

  • Telstra
  • Visual Studio Online
    • Caused by a memory-hogging bug in MS SQL Server’s query planner.

  • TNReady
    • Tennessee (US state) saw an outage of the new online version of its school system’s standardized tests.

  • CBS Sports App
    • During the Super Bowl is a terrible time to fail, but of course it’s more likely due to the peak in demand.

  • TPG Submarine Fiber Optic Cable
    • This one has some really interesting discussion about how the fiber industry handles failures.

  • Apple Pay
Updated: February 14, 2016 — 10:12 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme