SRE Weekly Issue #45

This past Friday, it was as if Oprah were laughing maniacally, shouting, “You get an outage, and you get an outage, EVERYONE GETS AN OUTAGE!” Hat-tip to all of you who spent the day fighting fires like I did.

I’ve decided to dedicate this week’s issue entirely to the incident, the fallout, and information on redundant DNS solutions. I’ll save all of the other great articles from this week for the next issue.

SPONSOR MESSAGE

Tell the world what you think about being on-call. Participate in the annual State of On-Call Survey.

Articles

The Register has a good overview of the attacks.

Dyn released this statement on Saturday with more information on the outage. Anecdotally (based on my direct experience), I’m not sure their timeline is quite right, as it seems that I and others I know saw impact later than Dyn’s stated resolution time of 1pm US Eastern time. Dyn’s status post indicates resolution after 6pm Eastern, which matches more closely with what I saw.

Among many other sites and services, AWS experienced an outage. They posted an unprecedented amount of detail during the incident in a banner on their status site. Their status page history doesn’t include the banner text, so I’ll quote it here:

These events were caused by errors resolving the DNS hostnames for some AWS endpoints. AWS uses multiple DNS service providers, including Amazon Route53 and third-party service providers. The root cause was an availability event that occurred with one of our third-party DNS service providers. We have now applied mitigations to all regions that prevent impact from third party DNS availability events.

Nice job with the detailed information, AWS!

Krebs on Security notes that the attack came hours after a talk on DDoS attacks at NANOG. Krebs was a previous target of a massive DDoS, apparently in retaliation for his publishing of research on DDoS attacks.

Paging through PagerDuty was down or badly broken throughout Friday. Many pages didn’t come through, and those that did sometimes couldn’t be acknowledged. Some pages got stuck and PagerDuty’s system would repeatedly call engineers and leave the line silent. [source: personal experience] Linked is PagerDuty’s “Initial Outage Report”. It includes a preliminary explanation of what went wrong, an apology, and a pledge to publish two more posts: a detailed timeline and a root cause analysis with remediation items.

Sumo Logic implemented dual nameserver providers after a previous DNS incident. This allowed them to survive the Dyn outage relatively unscathed.

Anonymous and New World Hackers have claimed responsibility for Friday’s attack, but that claim may be suspect. Their supposed reasons were to “retaliate for the revocation of Julian Assange’s Internet access” and to “test their strength”.

WikiLeaks bought their claim and asked supporters to cut it out.

A great basic overview of how recursive DNS queries work, published by Microsoft.

A much more detailed explanation if how DNS works. Recursive and iterative DNS queries are covered in more depth, along with AXFR/IXFR/NOTIFY which can be used to set up a redundant secondary DNS provider for your domain.

Heroku was among many sites impacted by the attack. Heroku’s status site was also impacted, prompting them to create a temporary mirror. This is painfully reminiscent of an article linked two issues ago about making sure your DNS is redundant.

Full disclosure: Heroku is my employer.

EasyDNS has this guide on setting up redundant DNS providers to safeguard against an attack like this. However, I’m not sure about their concept of “fast fluxing” your nameservers, that is, changing your nameserver delegation with your registrar when your main DNS provider is under attack.

Unless I’m mistaken, a change in nameservers for a domain can take up to 2 days (for .com) to be seen by all end-users, because their ISP’s recursive resolver could have your domain’s NS records cached, and the TTL given by the .com nameservers is 2 days.

Am I missing something? Maybe recursive resolvers optimistically re-resolve your domain’s NS records if all of your nameservers time out or return SERVFAIL? If you know more about this, please write to me!

Outages

I’m not going to bother naming all of the many sites and services that went down on Friday, because there were too many and I’d inevitably miss some. The below outages all occurred earlier in the week.

Updated: October 23, 2016 — 10:20 pm
SRE WEEKLY © 2015 Frontier Theme