SRE Weekly Issue #91

I’m heading to New York tomorrow and will be at Velocity Tuesday and Wednesday. If you’re there, look for the weirdo in the SRE Weekly shirt and hit me up for some nifty swag! Also, maybe check out my talk on DNS, if you’re into that kind of thing.

Thanks to an eagle-eyed reader for pointing out that I totally screwed up the HTML on the link last week. Oops.


Like DevOps? Register for All Day DevOps – a FREE online conference this October, offering 100 DevOps-focused sessions across six different tracks. Learn more & register:


Here’s how Hosted Graphite made their job ad for an SRE-like role (Ops Automation Engineer) more inclusive. The article is filled with specific before/after language snippets, each with a detailed explanation of why they made the change.

A couple weeks after their major outage last October, Dyn published this article explaining secondary DNS. It’s a great primer and digs into what to do if you use advanced non-standard functionality like ALIAS records and traffic balancing.

SignalFx goes into deep detail on their feature for predicting future metric values. We get an explanation of why prediction is difficult and a discussion of the math involved in their solution.

Payments: we really have to get them right. Here’s DropBox’s Jessica Fisher with a discussion of how they reconcile failed payments.

No matter what goes wrong, our top priority is to make sure that customers receive service for which they’ve been charged, and aren’t charged for service they haven’t received.

A couple of weeks ago, I linked to a story about Resilience4j, a fault tolerance library for Java. This week is the second installment that shows you how to use it to implement circuit breakers. There’s also an interesting discussion of one of the implementation details.

Here’s a cute little debugging story. Turns out ntpd has a bit of a blind spot!

Adcash CTO Arnaud Granal gives us a rare glimpse into the multiple iterations of their infrastructure. Hear what worked well and what didn’t as they scaled to handle 500k requests per second at peak.


  • OpenSRS (DNS provider)
    • OpenSRS (registrar and DNS provider, among other services) had a major outage in their DNS service.

      At 1AM UTC we were the target of a sophisticated DNS attack that was followed by an unrelated double failure of core network equipment at our main Canadian data center, caused by an undocumented software limitation.


  • Amadeus (airline booking system)
    • Amadeus provides the technical underpinnings of many airlines around the world. They had issues this past week, taking a lot of airlines with them.
  • SourceForge
    • Our [data center] hosting provider has been having issues with a power distribution unit.

  • Facebook
Updated: October 1, 2017 — 8:52 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme