General

SRE Weekly Issue #46

This may be the biggest issue to date. Lots of great articles this week, plus updates from the Dyn DDoS, and of course all of the awesome content I held off on posting last week.

SPONSOR MESSAGE

Tell the world what you think about being on-call. Participate in the annual State of On-Call Survey.

Articles

I’ve linked to several posts on Etsy’s Code as Craft blog in the past, and here’s another great one. Perhaps not the typical SRE article you might have been expecting me to link to, but this stuff is important in every tech field, including SRE. We can’t succeed unless every one of us has a fair chance at success.

In 2014, CodeSpaces suffered a major security breach that abruptly ended their business. I’d say that’s a pretty serious reliability risk right there, showing that security and reliability are inextricably intertwined.

Check it out! Catchpoint is doing another Ask Me Anything, this time about incident response. Should be interesting!

My fellow Heroku SRE, Courtney Eckhardt, expanded on a section of our joint SRECon talk for this session at OSFeels. She had time for Q&A, and there were some really great questions!

Mathias rocks it, as usual, in this latest issue of Production Ready.

Netflix has released a new version of Chaos Monkey with some interesting new features.

Scalyr worked with Mathias Lafeldt to turn his already-awesome pair of articles into this excellent essay. He brings in real-world examples of major outages and draws conclusions based on them. He also hits on a number of other topics he’s written about previously. Great work, folks!

How many times have you pushed out a puppet change that you tested very thoroughly, only to find that it did something unexpected on a host you didn’t even think it would apply to? Etsy’s solution to that is this tool that shows catalog changes for all host types in your fleet in a diff-style format.

“We ended up 50 percent over our worst case after day one, we figured this was going to be bad within six hours”

It’s pretty impressive to me that Niantic managed to keep Pokemon Go afloat as well as they did. They worked very closely with Google to grow their infrastructure much faster than they had planned to.

As Spotify has grown to 1400 microservices and 10,000 servers, they’ve moved toward a total ownership model, in which development teams are responsible for their code in production.

Pingdom suffered a major outage during the Dyn DDoS, not only due to their own DNS-related issues, but also due to the massive number of alerts their system was trying to send out to notify customers that their services were failing.

[…] at 19:20 we went to DEFCON 1 and suspended the alerting service altogether.

Here’s Dyn’s write-up of the DDoS.

As they promised, here’s PagerDuty’s root cause analysis from the Dyn DDoS.

This is a pretty great idea. Ably has written their client libraries to reach their service through a secondary domain if the primary one is having DNS issues. Interestingly, their domain, ably.io, was also impacted by the .io TLD DNS outage (detailed below) just days after they wrote this.

StatusPage.io gives us some really interesting numbers around the Dyn DDoS, based on status posts made by their customers.  I wonder, was it really the largest in history?

Here’s a nice write-up of day one of ServerlessConf, with the theme, “NoOps isn’t a thing”.

Not a magic bullet, but still pretty interesting.

[…] it allows you to incrementally replicate live Virtual Machines (VMs) to the cloud without the need for a prolonged maintenance period. You can automate, schedule, and track incremental replication of your live server volumes, simplifying the process of coordinating and implementing large-scale migrations that span tens or hundreds of volumes.

Earlier this year, Australia’s online census site suffered a major outage. Here’s a little more detail into what went wrong. TL;DR: a router dropped its configuration on reboot.

Gov.uk has put in place a lot of best practices in incident response and on-call health.

After extensive rationalisation, GOV.UK have reached a stage where only 6 types of incidents can alert (wake them up) out of hours. The rest can wait until next morning.

Unfortunately, I’m guessing one of those six types happened this week, as you can see in the Outages section below.

Outages

  • .io TLD
    • The entire .io top-level domain went down, resulting in impact to a lot of trendy companies that use *.io domains. It doesn’t matter how many DNS providers you have for your domain if your TLD’s nameservers aren’t able to give out their IPs. Worse yet, .io‘s servers sometimes did respond, but with an incorrect NXDOMAIN for valid domains. .io‘s negative-caching TTL of 3600 seconds made this pretty nasty.

      On the plus side, this outage provided the last piece in the puzzle in answering my question, “does ‘fast-fluxing’ your DNS providers really work?”. Answer: no. I’ll write up all of my research soon and post a link to it here.

  • The Pirate Bay
  • California DMV
  • AT&T
  • British Telecom
  • gov.uk
  • PlayStation Network

SRE Weekly Issue #45

This past Friday, it was as if Oprah were laughing maniacally, shouting, “You get an outage, and you get an outage, EVERYONE GETS AN OUTAGE!” Hat-tip to all of you who spent the day fighting fires like I did.

I’ve decided to dedicate this week’s issue entirely to the incident, the fallout, and information on redundant DNS solutions. I’ll save all of the other great articles from this week for the next issue.

SPONSOR MESSAGE

Tell the world what you think about being on-call. Participate in the annual State of On-Call Survey.

Articles

The Register has a good overview of the attacks.

Dyn released this statement on Saturday with more information on the outage. Anecdotally (based on my direct experience), I’m not sure their timeline is quite right, as it seems that I and others I know saw impact later than Dyn’s stated resolution time of 1pm US Eastern time. Dyn’s status post indicates resolution after 6pm Eastern, which matches more closely with what I saw.

Among many other sites and services, AWS experienced an outage. They posted an unprecedented amount of detail during the incident in a banner on their status site. Their status page history doesn’t include the banner text, so I’ll quote it here:

These events were caused by errors resolving the DNS hostnames for some AWS endpoints. AWS uses multiple DNS service providers, including Amazon Route53 and third-party service providers. The root cause was an availability event that occurred with one of our third-party DNS service providers. We have now applied mitigations to all regions that prevent impact from third party DNS availability events.

Nice job with the detailed information, AWS!

Krebs on Security notes that the attack came hours after a talk on DDoS attacks at NANOG. Krebs was a previous target of a massive DDoS, apparently in retaliation for his publishing of research on DDoS attacks.

Paging through PagerDuty was down or badly broken throughout Friday. Many pages didn’t come through, and those that did sometimes couldn’t be acknowledged. Some pages got stuck and PagerDuty’s system would repeatedly call engineers and leave the line silent. [source: personal experience] Linked is PagerDuty’s “Initial Outage Report”. It includes a preliminary explanation of what went wrong, an apology, and a pledge to publish two more posts: a detailed timeline and a root cause analysis with remediation items.

Sumo Logic implemented dual nameserver providers after a previous DNS incident. This allowed them to survive the Dyn outage relatively unscathed.

Anonymous and New World Hackers have claimed responsibility for Friday’s attack, but that claim may be suspect. Their supposed reasons were to “retaliate for the revocation of Julian Assange’s Internet access” and to “test their strength”.

WikiLeaks bought their claim and asked supporters to cut it out.

A great basic overview of how recursive DNS queries work, published by Microsoft.

A much more detailed explanation if how DNS works. Recursive and iterative DNS queries are covered in more depth, along with AXFR/IXFR/NOTIFY which can be used to set up a redundant secondary DNS provider for your domain.

Heroku was among many sites impacted by the attack. Heroku’s status site was also impacted, prompting them to create a temporary mirror. This is painfully reminiscent of an article linked two issues ago about making sure your DNS is redundant.

Full disclosure: Heroku is my employer.

EasyDNS has this guide on setting up redundant DNS providers to safeguard against an attack like this. However, I’m not sure about their concept of “fast fluxing” your nameservers, that is, changing your nameserver delegation with your registrar when your main DNS provider is under attack.

Unless I’m mistaken, a change in nameservers for a domain can take up to 2 days (for .com) to be seen by all end-users, because their ISP’s recursive resolver could have your domain’s NS records cached, and the TTL given by the .com nameservers is 2 days.

Am I missing something? Maybe recursive resolvers optimistically re-resolve your domain’s NS records if all of your nameservers time out or return SERVFAIL? If you know more about this, please write to me!

Outages

I’m not going to bother naming all of the many sites and services that went down on Friday, because there were too many and I’d inevitably miss some. The below outages all occurred earlier in the week.

SRE Weekly Issue #44

SPONSOR MESSAGE

DevOps Executive Webinar: Security for Startups in a DevOps World. http://try.victorops.com/l/44432/2016-10-12/fgh7n3

Articles

With all the “NoOps” and “Serverless” stuff floating around, do we need ops? Susan Fowler says not necessarily, but that we do need ops skills.

VictorOps is gathering data for the 2016 edition of their yearly State of On-Call Report (2015’s if you missed it). Please click the link above and take the survey if you have a moment! The report provides some pretty awesome stats that we can all use to improve the on-call experience at our organizations.

This survey is published by my sponsor, VictorOps, but their sponsorship did not influence its inclusion in this issue.

Scalyr writes about cascading failure scenarios, using the DynamoDB outage of September 20th, 2015 (no, not this year’s September DynamoDB outage) as a case study.

Capacity problems are a common type of failure, and often they’re of this “cascading” variety. A system that’s thrashing around in a failure state often uses more resources than it did when it was healthy, creating a self-reinforcing overload.

Check it out! Apparently this newsletter started around the same time that SRE Weekly did. Content includes a lot of really nifty stuff about Linux system administration.

I previously linked to a twopart series by Mathias Lafeldt on writing postmortems. At my request, Jimdo graciously agreed to release their (previously) internal postmortem about the incident that prompted him to write the articles. Thanks so much, Mathias!

A review of what sounds like a really interesting play about just culture, blameless retrospectives, and restorative justice in aviation, based on real events.

Thanks to Mathias Lafeldt for this one.

When you’re big like Facebook, sometimes reliability means essentially building your own Internet.

If you haven’t had time to watch Matt Ranney’s talk on Scaling Uber to 1000 Microservices, check out this detailed summary. Growing your engineering force 10x over a year while still keeping the service reliable is a pretty impressive feat.

PagerDuty shares some tips for lowering your MTTR, but first they ask the important question: how are you measuring MTTR, and is lowering it meaningful?

David Christensen riffs on Charity Majors’s concept of “3 Types of Code”: “no code” (SaaS, PaaS, etc), “someone else’s code”, and “your code”. Try to spend as much development time as possible writing code that supports what makes your business unique (your key differentiator).

Julia Evans is back with a write-up of the lessons she’s learned as she’s begun to gain an understanding of operations. My favorite bit:

Stage 2.5: learn to be scared
I think learning to be scared is a really important skill – you should be worried about upgrading a database safely, or about upgrading the version of Ruby you’re using in production. These are dangerous changes!

SysAdvent is happening again this year! Click the link above if you’d like to propose an article or volunteer to be an editor.

Outages

  • United Airlines
  • Yahoo mail
  • Google Cloud
  • FNB (South Africa bank)
  • GlobalSign (SSL certificate authority)
    • GlobalSign had a major problem in their PKI that resulted in all of their certificates being treated as revoked. They’ve posted a detailed postmortem that’s pretty heavy on deep SSL details, but the basic story is that their OCSP service misinterpreted a routine action as a request to revoke their intermediate CA certificate. Yikes.I love this quote and the mental image of a panicked party with streamers and ribbon-cutting that it conjures up:

      Our AlphaSSL and CloudSSL customers had to wait a few hours more while an emergency key ceremony was held to create alternatives.

SRE Weekly Issue #43

Dreamforce this past week was insanely busy but tons of fun.  My colleague Courtney Eckhardt and I gave a shorter version of our talk at SRECon16 about SRE and human factors.

SPONSOR MESSAGE

Downtime costs a lot more than you think. Learn why – and how to make the case for Real-time Incident Management. http://try.victorops.com/l/44432/2016-07-13/dpn2qw

Articles

A theme here in the past few issues has been the insane growth in complexity in our infrastructures. Honeycomb is a new tool-as-a-service to help you make sense of that complexity through event-based introspection. Think ELK or Splunk, but opinionated and way faster. The goal is to give you the ability to reach a state of flow in asking and answering questions about your infrastructure, so you can understand it more deeply, find problems you didn’t know you had, and discover new questions to ask. Here’s where I started getting really interested:

We have optimized Honeycomb for speed, for rapid iteration, for explorability. Waiting even 10-15 seconds for a view to load will cut your focus, will take you out of the moment, will break the spell of exuberant creative flow.

Mathias Lafeldt rocks it again, this time with a great essay on finding root causes for an incident. I love the idea of using the term “Contributing Conditions” instead. And the Retrospective Prime Directive is so on-point I’ve gotta re-quote it here:

Regardless of what we discover, we understand and truly believe that everyone did the best job they could, given what they knew at the time, their skills and abilities, the resources available, and the situation at hand.

This paper review by The Morning Paper reminds us of the importance of checking return codes and properly handling errors. Best part: solid statistical evidence.

A followup note on Rachel Kroll’s hilarious and awesome story about 1213486160 (a.k.a. “HTTP”). Basically, if you see a weird number showing up in your logs, it might be a good idea to try interpreting it as a string!

A solid basic primer on Netflix’s chaos engineering tools, with some info about the history and motivation behind them. I love the bit about how they ran into issues when Chaos Monkey terminated itself. Oops.

This article should really be titled, Make Sure Your DNS Is Reliable! It’s easy to forget that all the HA in the world won’t help your infrastructure if the traffic never reaches it due to a DNS failure. And here’s a really good corollary:

Even if your status site is on a separate subdomain, web host, etc… it will still be unavailable if your DNS goes down.

We’ve had a couple of high-profile airline computer system failures this year. Here’s an analysis of the difficulty companies are having bolting new functionality onto systems from the 90s and earlier, even as those systems try to support higher volume due to airline mergers. You may want to skip the bits toward the end that read like an ad, though.

I don’t think I’ve ever been at a company with a dedicated DBA role. It’s becoming a thing of the past, and instead ops folks (and increasingly developers) are becoming the new DBAs. Charity Majors tells us that we need to apply proper operational principals to our datastores. One change at a time, proper deploy and rollback plans, etc.

I love this idea: it’s an exercise in building your own command-line shell. It’s important to have a good grounding in the fundamentals of how processes get spawned and IO works in POSIX systems. Occasionally that’s the only way you can get to the root cause(s) of a really thorny incident.

Outages

SRE Weekly Issue #42

SPONSOR MESSAGE

[WEBINAR] The Do’s and Dont’s of Post-Incident Analysis. Join VictorOps and Datadog to get an inside look at how to conduct modern post-incident analysis. Sign up now: http://try.victorops.com/l/44432/2016-09-21/f8k6rn

Articles

Netflix’s API has an advanced circuit-breaker system including a defined automated fallback plan for every dependency.

This is Sydney Dekker’s course on Just Culture, including a full explanation of Restorative Just Culture. I especially like the concept of Second Victims of incidents: the practitioner (e.g. engineer) that was directly involved in the incident.

 Your practitioners are not necessarily the cause of the incident. They themselves are the recipients of trouble deeper inside your organization.

Think you know how TCP works? There are sneaky edge-cases that can cause an outage if you don’t know about them. Example: a MySQL replicating slave will happily report “0 seconds behind master” indefinitely while waiting on a connection to the master that’s long-since silently failed.

Etsy shares the operational issues they encountered as they moved toward an API/microservice architecture. I especially like the detail about limiting concurrent in-flight sub-requests per root request across the entire request tree.

My co-worker at Heroku, Stella Cotton, gave this rockin’ keynote at RailsConf 2016. She covers load testing and performance bottleneck diagnosis, and most of what she says applies not just to Rails.

Here’s a summary of a talk about Uber’s system that stores live location data of riders and drivers. They run Cassandra in containers managed by Mesos.

With an MVP, you’re just trying to get into the market and test the waters as quickly as possible, so there’s a temptation to leave considerations like scalability for later. But what if your MVP is unexpectedly successful?

Systems We Love is a new conference modeled after the popular Papers We Love. It looks really interesting, and they’re saying they already have a lot of great proposals.

Travis CI shares more about a major outage last month.

A nice incident response primer from Scalyr.

Outages

A production of Tinker Tinker Tinker, LLC Frontier Theme