SRE Weekly Issue #46

View on sreweekly.com

This may be the biggest issue to date. Lots of great articles this week, plus updates from the Dyn DDoS, and of course all of the awesome content I held off on posting last week.

Articles

Being an Effective Ally to Women and Non-Binary People

I’ve linked to several posts on Etsy’s Code as Craft blog in the past, and here’s another great one. Perhaps not the typical SRE article you might have been expecting me to link to, but this stuff is important in every tech field, including SRE. We can’t succeed unless every one of us has a fair chance at success.

Hacker puts ‘full redundancy’ code-hosting firm out of business

In 2014, CodeSpaces suffered a major security breach that abruptly ended their business. I’d say that’s a pretty serious reliability risk right there, showing that security and reliability are inextricably intertwined.

OUTAGE! AMA

Check it out! Catchpoint is doing another Ask Me Anything, this time about incident response. Should be interesting!

Complex System Failures and Blameless Retrospectives by Courtney Eckhardt

My fellow Heroku SRE, Courtney Eckhardt, expanded on a section of our joint SRECon talk for this session at OSFeels. She had time for Q&A, and there were some really great questions!

The Power of Less Code

Mathias rocks it, as usual, in this latest issue of Production Ready.

The Netflix Tech Blog: Netflix Chaos Monkey Upgraded

Netflix has released a new version of Chaos Monkey with some interesting new features.

The Myth of the Root Cause: How Complex Web Systems Fail

Scalyr worked with Mathias Lafeldt to turn his already-awesome pair of articles into this excellent essay. He brings in real-world examples of major outages and draws conclusions based on them. He also hits on a number of other topics he’s written about previously. Great work, folks!

octocatalog-diff: GitHub’s Puppet development and testing tool – GitHub Engineering

How many times have you pushed out a puppet change that you tested very thoroughly, only to find that it did something unexpected on a host you didn’t even think it would apply to? Etsy’s solution to that is this tool that shows catalog changes for all host types in your fleet in a diff-style format.

Pokemon Go: How the cloud saved the smash hit game from collapse

“We ended up 50 percent over our worst case after day one, we figured this was going to be bad within six hours”

It’s pretty impressive to me that Niantic managed to keep Pokemon Go afloat as well as they did. They worked very closely with Google to grow their infrastructure much faster than they had planned to.

Spotify Engineering: Making Ops Human

As Spotify has grown to 1400 microservices and 10,000 servers, they’ve moved toward a total ownership model, in which development teams are responsible for their code in production.

How the Friday DDoS attack affected Pingdom

Pingdom suffered a major outage during the Dyn DDoS, not only due to their own DNS-related issues, but also due to the massive number of alerts their system was trying to send out to notify customers that their services were failing.

[…] at 19:20 we went to DEFCON 1 and suspended the alerting service altogether.

Dyn Analysis Summary Of Friday October 21 Attack

Here’s Dyn’s write-up of the DDoS.

Service Disruption Root Cause Analysis and Follow-up Actions from October 21st, 2016

As they promised, here’s PagerDuty’s root cause analysis from the Dyn DDoS.

Routing around single point of failure DNS issues

This is a pretty great idea. Ably has written their client libraries to reach their service through a secondary domain if the primary one is having DNS issues. Interestingly, their domain, ably.io, was also impacted by the .io TLD DNS outage (detailed below) just days after they wrote this.

Looking Back On The Largest DDoS In History

StatusPage.io gives us some really interesting numbers around the Dyn DDoS, based on status posts made by their customers. I wonder, was it really the largest in history?

Serverless Operations is Not a Solved Problem

Here’s a nice write-up of day one of ServerlessConf, with the theme, “NoOps isn’t a thing”.

AWS Server Migration Service

Not a magic bullet, but still pretty interesting.

[…] it allows you to incrementally replicate live Virtual Machines (VMs) to the cloud without the need for a prolonged maintenance period. You can automate, schedule, and track incremental replication of your live server volumes, simplifying the process of coordinating and implementing large-scale migrations that span tens or hundreds of volumes.

Reset router could have saved Census

Earlier this year, Australia’s online census site suffered a major outage. Here’s a little more detail into what went wrong. TL;DR: a router dropped its configuration on reboot.

How GOV.UK Reduced their Incidents and Alerts

Gov.uk has put in place a lot of best practices in incident response and on-call health.

After extensive rationalisation, GOV.UK have reached a stage where only 6 types of incidents can alert (wake them up) out of hours. The rest can wait until next morning.

Unfortunately, I’m guessing one of those six types happened this week, as you can see in the Outages section below.

Outages

.io TLD
- The entire .io top-level domain went down, resulting in impact to a lot of trendy companies that use *.io domains. It doesn’t matter how many DNS providers you have for your domain if your TLD’s nameservers aren’t able to give out their IPs. Worse yet, .io‘s servers sometimes did respond, but with an incorrect NXDOMAIN for valid domains. .io‘s negative-caching TTL of 3600 seconds made this pretty nasty.
  
  On the plus side, this outage provided the last piece in the puzzle in answering my question, “does ‘fast-fluxing’ your DNS providers really work?”. Answer: no. I’ll write up all of my research soon and post a link to it here.
The Pirate Bay
California DMV
AT&T
British Telecom
gov.uk
PlayStation Network

SRE Weekly Issue #46

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

SPONSOR MESSAGE

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues