SRE Weekly Issue #43

Dreamforce this past week was insanely busy but tons of fun.  My colleague Courtney Eckhardt and I gave a shorter version of our talk at SRECon16 about SRE and human factors.


Downtime costs a lot more than you think. Learn why – and how to make the case for Real-time Incident Management.


A theme here in the past few issues has been the insane growth in complexity in our infrastructures. Honeycomb is a new tool-as-a-service to help you make sense of that complexity through event-based introspection. Think ELK or Splunk, but opinionated and way faster. The goal is to give you the ability to reach a state of flow in asking and answering questions about your infrastructure, so you can understand it more deeply, find problems you didn’t know you had, and discover new questions to ask. Here’s where I started getting really interested:

We have optimized Honeycomb for speed, for rapid iteration, for explorability. Waiting even 10-15 seconds for a view to load will cut your focus, will take you out of the moment, will break the spell of exuberant creative flow.

Mathias Lafeldt rocks it again, this time with a great essay on finding root causes for an incident. I love the idea of using the term “Contributing Conditions” instead. And the Retrospective Prime Directive is so on-point I’ve gotta re-quote it here:

Regardless of what we discover, we understand and truly believe that everyone did the best job they could, given what they knew at the time, their skills and abilities, the resources available, and the situation at hand.

This paper review by The Morning Paper reminds us of the importance of checking return codes and properly handling errors. Best part: solid statistical evidence.

A followup note on Rachel Kroll’s hilarious and awesome story about 1213486160 (a.k.a. “HTTP”). Basically, if you see a weird number showing up in your logs, it might be a good idea to try interpreting it as a string!

A solid basic primer on Netflix’s chaos engineering tools, with some info about the history and motivation behind them. I love the bit about how they ran into issues when Chaos Monkey terminated itself. Oops.

This article should really be titled, Make Sure Your DNS Is Reliable! It’s easy to forget that all the HA in the world won’t help your infrastructure if the traffic never reaches it due to a DNS failure. And here’s a really good corollary:

Even if your status site is on a separate subdomain, web host, etc… it will still be unavailable if your DNS goes down.

We’ve had a couple of high-profile airline computer system failures this year. Here’s an analysis of the difficulty companies are having bolting new functionality onto systems from the 90s and earlier, even as those systems try to support higher volume due to airline mergers. You may want to skip the bits toward the end that read like an ad, though.

I don’t think I’ve ever been at a company with a dedicated DBA role. It’s becoming a thing of the past, and instead ops folks (and increasingly developers) are becoming the new DBAs. Charity Majors tells us that we need to apply proper operational principals to our datastores. One change at a time, proper deploy and rollback plans, etc.

I love this idea: it’s an exercise in building your own command-line shell. It’s important to have a good grounding in the fundamentals of how processes get spawned and IO works in POSIX systems. Occasionally that’s the only way you can get to the root cause(s) of a really thorny incident.


Updated: October 10, 2016 — 3:21 am
SRE WEEKLY © 2015 Frontier Theme