View on sreweekly.com
Lots of outages this week, although not as many as in some previous years on Black Friday. We’ll see what Cyber Monday brings.
I’m writing this from the airport on my way to re:Invent. Perhaps I’ll see some of you there as I rush about from meeting to meeting.
Complete with a nifty flow-chart for informed decision-making.
As the title suggests, this article by New Relic is about the mindset of an SRE. I really love number 3, where they discuss the idea that gating production deploys can actually reduce reliability rather than improve it.
It’s what it says on the tin, and it’s targeted for DigitalOcean. One could also use this as a general primer on setting up HeartBeat failover using other cloud platforms.
The Chaos Toolkit is a free, open source project that enables you to create and apply Chaos Experiments to various types of infrastructure, platforms and applications.
It currently supports Kubernetes and Spring.
Here’s a neat little overview of the temporary but massive network that joins the re:Invent venues up and down the Las Vegas strip. Half of the strip is also set up for Direct Connect to the nearest AWS region.
The three pitfalls discussed are confusing EBS latency, idle EC2 instances wasting money, and memory leaks. My favorite gotcha isn’t mentioned: performance cliffs caused by running out of burst in T2 instances or GP2 volumes.
View on sreweekly.com
Last month, I linked to an article on Xero’s incident response process, and I said:
I find it interesting that incident response starts off with someone filling out a form.
This article goes into detail on how the form works, why they have it, and the actual questions on the form! Then they go on to explain their “on-call configuration as code” setup, which is really nifty. I can’t wait to see part II and beyond.
Spokes is GitHub’s system for storing distributed replicas of git repositories. This article explains how they can do this over long distances in a reasonable amount of time (and why that’s hard). I especially love the “Spokes checksum” concept.
From the CEO of NS1, a piece on the value of checklists in incident response.
Here’s another great guide on the hows and whys of secondary DNS, including options on dealing with nonstandard record types that aren’t compatible with AXFR.
From a customer’s perspective, “planned downtime” and “outage” often mean the same thing.
“serverless” != “NoOps”
Willis urges the importance of integration with existing operations processes over replacement. “Serverless is just another form of compute. … All the core principles that we’ve really learned about high-performance organizations apply differently … but the principles stay the same,” he said.
When we use root cause analysis, says Michael Nygard, we narrow our focus into counter-factuals that get in the way of finding out what really happened.
CW: hypothetical violent imagery
This week had a weirdly large number of outages!