SRE Weekly Issue #40

Articles

On designing and deploying internet-scale services

Adrian Colyer summarizes James Hamilton’s 2007 paper in this edition of The Morning Paper. There’s a lot of excellent advice here — some I knew explicitly, some I mostly implement without thinking about it, and some I’d never thought about. The paper is great, but even if you don’t have time to read it, Colyer’s digest version is well worth a browse.

Embracing Failure

Susan Fowler (featured here a couple weeks ago) has a philosophy of failure in her life that I find really appealing as an SRE:

We can learn something about how to become the best versions of ourselves from how we engineer the best complex systems in the world of software engineering.

Second Chapter of Production-Ready Microservices by Susan Fowler

And while we’re on the subject of Susan Fowler, she’s got a book coming soon about writing reliable microservices. In the linked ebook-version of the second chapter, she goes over the requirements for a production-ready microservice: stability, reliability, scalability, fault-tolerance, catastrophe-preparedness, performance, monitoring, and documentation.

Sharding Pinterest: How we scaled our MySQL fleet

Pinterest explains how they broke their datastore up into 4096(!) shards on 4 pairs of MySQL servers (later 8192 on 8 pairs). It’s an interesting approach, although in essence it treats MySQL as a glorified key-value store for JSON documents.

Scalable and secure access with SSH

Do you use Kerberos or similar to authenticate your SSH users? What happens if there’s an incident that’s bad enough to take down your auth infrastructure? I hadn’t realized that openSSH supports CAs, but Facebook shows us that PKI support is easy and feature-rich.

DHCPLB: An open source load balancer

Another project from Facebook: a load balancer for DHCP. Facebook found that anycast was not distributing requests evenly across DHCP servers, so they wrote a loadbalancer in Go.

Safety Moment – Fundamental Attribution Error

In incident post-analysis, a fundamental attribution error is a tendency to see flaws in others as a cause if they were involved in an incident, but to blame the system if we were the one involved. This 4-minute segment from the Pre-Accident Podcast explains fundamental attribution error in more detail.

Introducing 411: A new open source framework for handling alerting

411 is Etsy’s new tool that runs scheduled queries against Elasticsearch and alerts on the result.

Outages

ING Bank
- Here’s a terribly interesting root cause: during a test, the fire response system emitted an incredibly loud sound while dumping an inert gas into the datacenter — probably loud enough to cause hearing damage. This caused failure in multiple key spinning hard drives. Remember shouting at hard drives?
Heroku Status
- Heroku released a followup with details on last week’s outage.
  Full disclosure: Heroku is my employer.
Gmail for Work
Microsoft Azure
- Major outage involving most DNS queries for Azure resources failing. Microsoft posted a report including a root cause analysis.

SRE Weekly Issue #40

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

SPONSOR MESSAGE

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues