SRE Weekly Issue #37

SPONSOR MESSAGE

Frustrated by the lack of tools available to automate incident response? Learn how ChatOps can help manage your operations through group chat in the latest book from O’Reilly. Get your copy here: http://try.victorops.com/l/44432/2016-08-19/f2xt33

Articles

Sometimes I follow chains of references from article to article until I find a new author to follow, and this time it’s Kelly Sommers. In this gem, she debunks the rarity of network partitions by recasting them as availability partitions. If half of your nodes aren’t responding because their CPUs are pegged, you still have a network partition.

most partitions I’ve experienced have nothing to do with network infrastructure failures

Two engineers from MMO company DIGIT gave this short, nicely detailed interview in which they outline how they achieve HA on AWS.

Here’s a recording of the DevOps/SRE AMA from a couple weeks back, in case you missed it.

A blog post by Skyline, who is launching their new deployment-as-a-service offering. The intro is pretty great, outlining the inherent risks in changing code and releasing new code into production.

Other online schema-change tools I’m familiar with (e.g. pt-online-schema-change) use triggers to keep a new table in sync with changes while copying old rows over. Instead, gh-ost monitors changes by hooking on as a replication slave. Very clever! This article goes into several reasons why this is a much better approach.

Outages

Updated: August 28, 2016 — 10:55 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme