SRE Weekly Issue #52

Merry Decemberween, all!  Much like trash pickup service, SRE Weekly comes one day late when it falls on a holiday.


Percentiles are tricky beasts. Does that graph really mean what you think it means?

The math is just broken. An average of a percentile is meaningless.

Thanks to Devops Weekly for this one.

There’s that magical “human error” again.

ChangeIP suffered a major outage two weeks ago and they posted this analysis of the incident. Thanks, folks! Does this sound familiar?

We learned that when we started providing this service to the world, we made design and data layout decisions that made sense at the time but no longer do.

Shuffle sharding is a nifty technique for preventing impact from spreading to multiple users of your service. A great example is the way Route 53 assigns nameservers for hosted DNS zones: 172800 IN NS 172800 IN NS 172800 IN NS 172800 IN NS

Fastly has a brilliant, simple, and clever solution to load balancing and connection draining using a switch ignorant of layer 4.

Incurring connection resets on upgrades has ramifications far beyond disrupting production traffic: it provides a disincentive for continuous software deployment.

Heroku shared a post-analysis of their major outage on December 15.

Full disclosure: Heroku is my employer.


  • NTP server pool
    • Load on the worldwide NTP server pool increased significantly due to a “buggy Snapchat app update”. What was Snapchat doing with NTP? (more details)
  • Zappos
    • Zappos had a cross-promotion with T-Mobile, and the traffic overloaded them.Thanks to Amanda Gilmore for this one.
  • Slack
    • Among other forms of impairment, /-commands were repeated numerous times. At $JOB, this meant that people accidentally paged their coworkers over and over until we disabled PagerDuty.
  • Librato
    • “What went well” is an important part of any post-analysis.
  • Tumblr
  • Southwest Airlines
