SRE Weekly Issue #40

SPONSOR MESSAGE

Take a bite out of all things DevOps with video series, DevChops. Get easy to digest explanations of most-used DevOps terms and concepts in 90 seconds or less. Watch now: http://try.victorops.com/l/44432/2016-09-16/f7gpzp

Articles

Adrian Colyer summarizes James Hamilton’s 2007 paper in this edition of The Morning Paper. There’s a lot of excellent advice here — some I knew explicitly, some I mostly implement without thinking about it, and some I’d never thought about. The paper is great, but even if you don’t have time to read it, Colyer’s digest version is well worth a browse.

Susan Fowler (featured here a couple weeks ago) has a philosophy of failure in her life that I find really appealing as an SRE:

We can learn something about how to become the best versions of ourselves from how we engineer the best complex systems in the world of software engineering.

And while we’re on the subject of Susan Fowler, she’s got a book coming soon about writing reliable microservices. In the linked ebook-version of the second chapter, she goes over the requirements for a production-ready microservice: stability, reliability, scalability, fault-tolerance, catastrophe-preparedness, performance, monitoring, and documentation.

Pinterest explains how they broke their datastore up into 4096(!) shards on 4 pairs of MySQL servers (later 8192 on 8 pairs). It’s an interesting approach, although in essence it treats MySQL as a glorified key-value store for JSON documents.

Do you use Kerberos or similar to authenticate your SSH users? What happens if there’s an incident that’s bad enough to take down your auth infrastructure? I hadn’t realized that openSSH supports CAs, but Facebook shows us that PKI support is easy and feature-rich.

Another project from Facebook: a load balancer for DHCP. Facebook found that anycast was not distributing requests evenly across DHCP servers, so they wrote a loadbalancer in Go.

In incident post-analysis, a fundamental attribution error is a tendency to see flaws in others as a cause if they were involved in an incident, but to blame the system if we were the one involved. This 4-minute segment from the Pre-Accident Podcast explains fundamental attribution error in more detail.

411 is Etsy’s new tool that runs scheduled queries against Elasticsearch and alerts on the result.

Outages

  • ING Bank
    • Here’s a terribly interesting root cause: during a test, the fire response system emitted an incredibly loud sound while dumping an inert gas into the datacenter — probably loud enough to cause hearing damage. This caused failure in multiple key spinning hard drives. Remember shouting at hard drives?
  • Heroku Status
    • Heroku released a followup with details on last week’s outage.

      Full disclosure: Heroku is my employer.

  • Gmail for Work
  • Microsoft Azure
    • Major outage involving most DNS queries for Azure resources failing. Microsoft posted a report including a root cause analysis.
Updated: September 18, 2016 — 10:01 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme