Articles
Sometimes I follow chains of references from article to article until I find a new author to follow, and this time it’s Kelly Sommers. In this gem, she debunks the rarity of network partitions by recasting them as availability partitions. If half of your nodes aren’t responding because their CPUs are pegged, you still have a network partition.
most partitions I’ve experienced have nothing to do with network infrastructure failures
Two engineers from MMO company DIGIT gave this short, nicely detailed interview in which they outline how they achieve HA on AWS.
Here’s a recording of the DevOps/SRE AMA from a couple weeks back, in case you missed it.
A blog post by Skyline, who is launching their new deployment-as-a-service offering. The intro is pretty great, outlining the inherent risks in changing code and releasing new code into production.
Other online schema-change tools I’m familiar with (e.g. pt-online-schema-change) use triggers to keep a new table in sync with changes while copying old rows over. Instead, gh-ost monitors changes by hooking on as a replication slave. Very clever! This article goes into several reasons why this is a much better approach.
Outages
- Google App Engine
-
The outage occurred on August 11, but they posted a postmortem this week.
-
- Buildkite
-
Includes an extremely detailed postmortem starting with paging failure and running through 6 lessons learned. #ThereIsNoOneRootCause
-
- Slack
- Second Life
-
Another awesome postmortem by April Linden.
-
- Travis CI
- eBay
- PlayStation Network
- iiNet (ISP)
- Google Compute Engine