SRE Weekly Issue #51

SPONSOR MESSAGE

The “2016/17 State of On-Call” report from VictorOps is now available to download. Learn what 800+ respondents have to say about life on-call, and steps they’re taking to make it better. Get your free copy here: https://victorops.com/state-of-on-call

Articles

This is a big moment for the SRE field. Etsy has distilled the internal training materials they use to teach employees how to facilitate retrospectives (“debriefings” in Etsy parlance). They’ve released a guide and posted this introduction that really stands firmly on its own. I love the real-world story they share.

And here’s the guide itself. This is essential reading for any SRE interested in understanding incidents in their organization.

Slicer is a general purpose sharding service. I normally think of sharding as something that happens within a (typically data) service, not as a general purpose infrastructure service.  What exactly is Slicer then?

Click through to find out. It’ll be interesting to see what open source projects this paper inspires.

The second in a series, this article delves into the pitfalls of aggregating metrics. Aggregation means you have to choose between bloating your time-series datastore or leaving out crucial stats that you may need during an investigation.

I thought this was going to be primarily an argument for reducing burnout to improve reliability. That’s in there, but the bulk of this article is a bunch of tips and techniques for improving your monitoring and alerting to reduce the likelihood that you’ll be pulled away from your vacation.

The title says it all. Losing the only person with the knowledge of how to keep your infrastructure running is a huge reliability risk. In this article, Heidi Waterhouse (who I coincidentally just met at LISA16!) makes it brilliantly clear why you need good documentation and how to get there.

Here’s another overview of implementing a secondary DNS provider. I like that they cover the difficulties that can arise when you use a provider’s proprietary non-RFC DNS extensions such as weighted round-robin record sets.

Outages

Updated: December 18, 2016 — 9:27 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme