SRE Weekly Issue #54

SPONSOR MESSAGE

The “2016/17 State of On-Call” report from VictorOps is now available to download. Learn what 800+ respondents have to say about life on-call, and steps they’re taking to make it better. Get your free copy here: https://victorops.com/state-of-on-call

Articles

Wow! PagerDuty made waves this week by releasing their internal incident response documentation. This is really exciting, and I’d love it if more companies did this. Their incident response procedures are detailed and obviously the result of hard-won experience. The hierarchical, almost militaristic command and control structure is intriguing and makes me wonder what problems they’re solving.

Lots of detail on New Relic’s load testing strategy, along with an interesting tidbit:

In addition, as we predicted, many sites deployed new deal sites specifically for Cyber Monday with less than average testing. Page load and JavaScript error data represented by far the largest percentage increase in traffic volume, with a 56% bump[…]

Last in the series, this article is an argument that metrics aren’t always enough. Sometimes you need to see the details of the actual events (requests, database operations, etc) that produced the high metric values, and traditional metrics solutions discard these in favor of just storing the numbers.

Let’s Encrypt has gone through a year of intense growth in usage. Their Incidents page has some nicely detailed postmortems, if you’re in the mood.

An eloquent post on striving toward a learning culture in your organization, as opposed to a blaming one, when discussing adverse incidents.

I like to include the occasional debugging deep-dive article, because it’s always good to keep our skills fresh. Here’s one from my coworker on finding the source of an unexpected git error message.

Full disclosure: Heroku, my employer, is mentioned.

Outages

Updated: January 8, 2017 — 9:48 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme