SRE Weekly Issue #53


The “2016/17 State of On-Call” report from VictorOps is now available to download. Learn what 800+ respondents have to say about life on-call, and steps they’re taking to make it better. Get your free copy here:


Without explicit limits, things fail in unexpected and unpredictable ways. Remember, the limits exist, they’re just hidden.

AWS gives us this in-depth explanation of their use of shuffle sharding in the Route 53 service. This is especially interesting given the Dyn DDoS attack a couple of months ago.

How does container networking work? Julia Evans points her curious mind toward this question and shares what she learned.

[…] it’s important to understand what’s going on behind the scenes, so that if something goes wrong I can debug it and fix it.

More on the subject of percentiles and incorrect math this week from Circonus. The SLA calculation stuff is especially on point.

And speaking of SLAs, here’s an excellent article on how to design and adopt an SLA in your product or service.

A summary of a few notable Systems We Love talks. I’m so jealous of all of you folks that got to go!

PagerDuty added #OnCallSelfie support to their app. Amusingly, that first picture is of my (awesome) boss.  Hi, Joy!

A post-analysis of an Azure outage from 2012. The especially interesting thing to me is the secondary outage caused by eagerness to quickly deploy a fix to the first outage. There’s a cognitive trap here: we become overconfident when we think we’ve found The Root Cause and we rush to deploy a patch.


Updated: January 1, 2017 — 8:17 pm
SRE WEEKLY © 2015 Frontier Theme