SRE Weekly Issue #43

View on sreweekly.com

Dreamforce this past week was insanely busy but tons of fun. My colleague Courtney Eckhardt and I gave a shorter version of our talk at SRECon16 about SRE and human factors.

Articles

Honeycomb and the Five Why’s

A theme here in the past few issues has been the insane growth in complexity in our infrastructures. Honeycomb is a new tool-as-a-service to help you make sense of that complexity through event-based introspection. Think ELK or Splunk, but opinionated and way faster. The goal is to give you the ability to reach a state of flow in asking and answering questions about your infrastructure, so you can understand it more deeply, find problems you didn’t know you had, and discover new questions to ask. Here’s where I started getting really interested:

We have optimized Honeycomb for speed, for rapid iteration, for explorability. Waiting even 10-15 seconds for a view to load will cut your focus, will take you out of the moment, will break the spell of exuberant creative flow.

On Finding Root Causes — Production Ready

Mathias Lafeldt rocks it again, this time with a great essay on finding root causes for an incident. I love the idea of using the term “Contributing Conditions” instead. And the Retrospective Prime Directive is so on-point I’ve gotta re-quote it here:

Regardless of what we discover, we understand and truly believe that everyone did the best job they could, given what they knew at the time, their skills and abilities, the resources available, and the situation at hand.

Simple testing can prevent most critical failures

This paper review by The Morning Paper reminds us of the importance of checking return codes and properly handling errors. Best part: solid statistical evidence.

1213486160 has a friend: 1195725856

A followup note on Rachel Kroll’s hilarious and awesome story about 1213486160 (a.k.a. “HTTP”). Basically, if you see a weird number showing up in your logs, it might be a good idea to try interpreting it as a string!

Netflix details chaos engineering

A solid basic primer on Netflix’s chaos engineering tools, with some info about the history and motivation behind them. I love the bit about how they ran into issues when Chaos Monkey terminated itself. Oops.

How to Handle an Outage Like a Pro

This article should really be titled, Make Sure Your DNS Is Reliable! It’s easy to forget that all the HA in the world won’t help your infrastructure if the traffic never reaches it due to a DNS failure. And here’s a really good corollary:

Even if your status site is on a separate subdomain, web host, etc… it will still be unavailable if your DNS goes down.

Exploring Airline Outages: A Developer’s Perspective

We’ve had a couple of high-profile airline computer system failures this year. Here’s an analysis of the difficulty companies are having bolting new functionality onto systems from the 90s and earlier, even as those systems try to support higher volume due to airline mergers. You may want to skip the bits toward the end that read like an ad, though.

The Accidental DBA

I don’t think I’ve ever been at a company with a dedicated DBA role. It’s becoming a thing of the past, and instead ops folks (and increasingly developers) are becoming the new DBAs. Charity Majors tells us that we need to apply proper operational principals to our datastores. One change at a time, proper deploy and rollback plans, etc.

GitHub – kamalmarhubi/shell-workshop

I love this idea: it’s an exercise in building your own command-line shell. It’s important to have a good grounding in the fundamentals of how processes get spawned and IO works in POSIX systems. Occasionally that’s the only way you can get to the root cause(s) of a really thorny incident.

Outages

Anik F2 (TV/telecom satellite)
eBay
GitHub
Twilio
National Australia Bank
Destiny (game)
Three (UK telecom)
Level 3 and many major US telecoms
- Level 3 mentioned a “configuration error”.
Outlook.com
PlayStation Network
Verizon
iTunes, App Store, and Apple Music
Netflix
Facebook

SRE Weekly Issue #43

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

SPONSOR MESSAGE

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues