SRE Weekly Issue #97


Attending AWS re:Invent 2017? Visit the VictorOps booth, schedule a meeting, or join us for some after hours fun. See you in Vegas!


Last month, I linked to an article on Xero’s incident response process, and I said:

I find it interesting that incident response starts off with someone filling out a form.

This article goes into detail on how the form works, why they have it, and the actual questions on the form! Then they go on to explain their “on-call configuration as code” setup, which is really nifty. I can’t wait to see part II and beyond.

Spokes is GitHub’s system for storing distributed replicas of git repositories. This article explains how they can do this over long distances in a reasonable amount of time (and why that’s hard). I especially love the “Spokes checksum” concept.

From the CEO of NS1, a piece on the value of checklists in incident response.

Here’s another great guide on the hows and whys of secondary DNS, including options on dealing with nonstandard record types that aren’t compatible with AXFR.

From a customer’s perspective, “planned downtime” and “outage” often mean the same thing.

“serverless” != “NoOps”

Willis urges the importance of integration with existing operations processes over replacement. “Serverless is just another form of compute. … All the core principles that we’ve really learned about high-performance organizations apply differently … but the principles stay the same,” he said.

When we use root cause analysis, says Michael Nygard, we narrow our focus into counter-factuals that get in the way of finding out what really happened.

CW: hypothetical violent imagery


This week had a weirdly large number of outages!

Updated: November 12, 2017 — 9:09 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme