SRE Weekly Issue #37

Articles

The “network partitions are rare” fallacy

Sometimes I follow chains of references from article to article until I find a new author to follow, and this time it’s Kelly Sommers. In this gem, she debunks the rarity of network partitions by recasting them as availability partitions. If half of your nodes aren’t responding because their CPUs are pegged, you still have a network partition.

most partitions I’ve experienced have nothing to do with network infrastructure failures

How DIGIT Created High Availability on the Public Cloud to Keep Its Games Running

Two engineers from MMO company DIGIT gave this short, nicely detailed interview in which they outline how they achieve HA on AWS.

DevOps & SRE AMA Video

Here’s a recording of the DevOps/SRE AMA from a couple weeks back, in case you missed it.

No Way Out But Through

A blog post by Skyline, who is launching their new deployment-as-a-service offering. The intro is pretty great, outlining the inherent risks in changing code and releasing new code into production.

gh-ost: GitHub’s online schema migration tool for MySQL

Other online schema-change tools I’m familiar with (e.g. pt-online-schema-change) use triggers to keep a new table in sync with changes while copying old rows over. Instead, gh-ost monitors changes by hooking on as a replication slave. Very clever! This article goes into several reasons why this is a much better approach.

Outages

Google App Engine
- The outage occurred on August 11, but they posted a postmortem this week.
Buildkite
- Includes an extremely detailed postmortem starting with paging failure and running through 6 lessons learned. #ThereIsNoOneRootCause
Slack
Second Life
- Another awesome postmortem by April Linden.
Travis CI
Facebook
eBay
PlayStation Network
iiNet (ISP)
Google Compute Engine

SRE Weekly Issue #37

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

SPONSOR MESSAGE

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues