General

SRE Weekly Issue #45

lex

October 23, 2016

This past Friday, it was as if Oprah were laughing maniacally, shouting, “You get an outage, and you get an outage, EVERYONE GETS AN OUTAGE!” Hat-tip to all of you who spent the day fighting fires like I did.

I’ve decided to dedicate this week’s issue entirely to the incident, the fallout, and information on redundant DNS solutions. I’ll save all of the other great articles from this week for the next issue.

Articles

DNS devastation: Top websites whacked offline as Dyn dies again

The Register has a good overview of the attacks.

Dyn Statement on 10/21/2016 DDoS Attack

Dyn released this statement on Saturday with more information on the outage. Anecdotally (based on my direct experience), I’m not sure their timeline is quite right, as it seems that I and others I know saw impact later than Dyn’s stated resolution time of 1pm US Eastern time. Dyn’s status post indicates resolution after 6pm Eastern, which matches more closely with what I saw.

AWS

Among many other sites and services, AWS experienced an outage. They posted an unprecedented amount of detail during the incident in a banner on their status site. Their status page history doesn’t include the banner text, so I’ll quote it here:

These events were caused by errors resolving the DNS hostnames for some AWS endpoints. AWS uses multiple DNS service providers, including Amazon Route53 and third-party service providers. The root cause was an availability event that occurred with one of our third-party DNS service providers. We have now applied mitigations to all regions that prevent impact from third party DNS availability events.

Nice job with the detailed information, AWS!

DDoS on Dyn Impacts Twitter, Spotify, Reddit

Krebs on Security notes that the attack came hours after a talk on DDoS attacks at NANOG. Krebs was a previous target of a massive DDoS, apparently in retaliation for his publishing of research on DDoS attacks.

PagerDuty

Paging through PagerDuty was down or badly broken throughout Friday. Many pages didn’t come through, and those that did sometimes couldn’t be acknowledged. Some pages got stuck and PagerDuty’s system would repeatedly call engineers and leave the line silent. [source: personal experience] Linked is PagerDuty’s “Initial Outage Report”. It includes a preliminary explanation of what went wrong, an apology, and a pledge to publish two more posts: a detailed timeline and a root cause analysis with remediation items.

How We Survived the Dyn DNS Outage

Sumo Logic implemented dual nameserver providers after a previous DNS incident. This allowed them to survive the Dyn outage relatively unscathed.

WikiLeaks supporters claim credit for massive U.S. cyberattack, but researchers skeptical

Anonymous and New World Hackers have claimed responsibility for Friday’s attack, but that claim may be suspect. Their supposed reasons were to “retaliate for the revocation of Julian Assange’s Internet access” and to “test their strength”.

WikiLeaks urges supporters to ‘stop taking down the US internet’ following DDoS cyberattack

WikiLeaks bought their claim and asked supporters to cut it out.

Recursive and Iterative Queries

A great basic overview of how recursive DNS queries work, published by Microsoft.

Chapter 2 DNS Concepts, from Pro DNS and Bind by Ron Aitchison

A much more detailed explanation if how DNS works. Recursive and iterative DNS queries are covered in more depth, along with AXFR/IXFR/NOTIFY which can be used to set up a redundant secondary DNS provider for your domain.

Heroku Status Site Incident 965: Issues with DNS resolution

Heroku was among many sites impacted by the attack. Heroku’s status site was also impacted, prompting them to create a temporary mirror. This is painfully reminiscent of an article linked two issues ago about making sure your DNS is redundant.

Full disclosure: Heroku is my employer.

DOS Attacks and DNS: How to Stay Up If Your DNS Provider goes DOWN

EasyDNS has this guide on setting up redundant DNS providers to safeguard against an attack like this. However, I’m not sure about their concept of “fast fluxing” your nameservers, that is, changing your nameserver delegation with your registrar when your main DNS provider is under attack.

Unless I’m mistaken, a change in nameservers for a domain can take up to 2 days (for .com) to be seen by all end-users, because their ISP’s recursive resolver could have your domain’s NS records cached, and the TTL given by the .com nameservers is 2 days.

Am I missing something? Maybe recursive resolvers optimistically re-resolve your domain’s NS records if all of your nameservers time out or return SERVFAIL? If you know more about this, please write to me!

Outages

I’m not going to bother naming all of the many sites and services that went down on Friday, because there were too many and I’d inevitably miss some. The below outages all occurred earlier in the week.

PSN
The New York Times
Google Compute Engine Load Balancers
- GCE’s LB offering returned progressively more 502s over the course of 2 hours, with a peak of 45% of requests failing. Linked is Google’s postmortem of the incident, which seems to have been a bad deploy.
Virgin Australia website

SRE Weekly Issue #44

lex

October 16, 2016

General

Comments

View on sreweekly.com

Articles

The Ops Identity Crisis

With all the “NoOps” and “Serverless” stuff floating around, do we need ops? Susan Fowler says not necessarily, but that we do need ops skills.

VictorOps State of On-Call Survey

VictorOps is gathering data for the 2016 edition of their yearly State of On-Call Report (2015’s if you missed it). Please click the link above and take the survey if you have a moment! The report provides some pretty awesome stats that we can all use to improve the on-call experience at our organizations.

This survey is published by my sponsor, VictorOps, but their sponsorship did not influence its inclusion in this issue.

Irreversible Failures: Lessons from the DynamoDB Outage

Scalyr writes about cascading failure scenarios, using the DynamoDB outage of September 20th, 2015 (no, not this year’s September DynamoDB outage) as a case study.

Capacity problems are a common type of failure, and often they’re of this “cascading” variety. A system that’s thrashing around in a failure state often uses more resources than it did when it was healthy, creating a self-reinforcing overload.

cron.weekly newsletter

Check it out! Apparently this newsletter started around the same time that SRE Weekly did. Content includes a lot of really nifty stuff about Linux system administration.

Deployer API Outage Postmortem

I previously linked to a two–part series by Mathias Lafeldt on writing postmortems. At my request, Jimdo graciously agreed to release their (previously) internal postmortem about the incident that prompted him to write the articles. Thanks so much, Mathias!

Human Factors at The Fringe: My Eyes Went Dark

A review of what sounds like a really interesting play about just culture, blameless retrospectives, and restorative justice in aviation, based on real events.

Thanks to Mathias Lafeldt for this one.

Building one of the highest-capacity subsea cables in the Pacific

When you’re big like Facebook, sometimes reliability means essentially building your own Internet.

Lessons Learned from Scaling Uber to 2000 Engineers, 1000 Services, and 8000 Git repositories

If you haven’t had time to watch Matt Ranney’s talk on Scaling Uber to 1000 Microservices, check out this detailed summary. Growing your engineering force 10x over a year while still keeping the service reliable is a pretty impressive feat.

6 Essential Steps to Reducing Incident Resolution Time

PagerDuty shares some tips for lowering your MTTR, but first they ask the important question: how are you measuring MTTR, and is lowering it meaningful?

Put Down the Mouse, Step Away From the Keyboard

David Christensen riffs on Charity Majors’s concept of “3 Types of Code”: “no code” (SaaS, PaaS, etc), “someone else’s code”, and “your code”. Try to spend as much development time as possible writing code that supports what makes your business unique (your key differentiator).

Operations for software developers for beginners

Julia Evans is back with a write-up of the lessons she’s learned as she’s begun to gain an understanding of operations. My favorite bit:

Stage 2.5: learn to be scared
I think learning to be scared is a really important skill – you should be worried about upgrading a database safely, or about upgrading the version of Ruby you’re using in production. These are dangerous changes!

SysAdvent 2016 Author/Editor Signup

SysAdvent is happening again this year! Click the link above if you’d like to propose an article or volunteer to be an editor.

Outages

United Airlines
Yahoo mail
Google Cloud
FNB (South Africa bank)
GlobalSign (SSL certificate authority)
- GlobalSign had a major problem in their PKI that resulted in all of their certificates being treated as revoked. They’ve posted a detailed postmortem that’s pretty heavy on deep SSL details, but the basic story is that their OCSP service misinterpreted a routine action as a request to revoke their intermediate CA certificate. Yikes.I love this quote and the mental image of a panicked party with streamers and ribbon-cutting that it conjures up:
  
  Our AlphaSSL and CloudSSL customers had to wait a few hours more while an emergency key ceremony was held to create alternatives.

SRE Weekly Issue #43

lex

October 9, 2016

General

Comments

View on sreweekly.com

Dreamforce this past week was insanely busy but tons of fun. My colleague Courtney Eckhardt and I gave a shorter version of our talk at SRECon16 about SRE and human factors.

Articles

Honeycomb and the Five Why’s

A theme here in the past few issues has been the insane growth in complexity in our infrastructures. Honeycomb is a new tool-as-a-service to help you make sense of that complexity through event-based introspection. Think ELK or Splunk, but opinionated and way faster. The goal is to give you the ability to reach a state of flow in asking and answering questions about your infrastructure, so you can understand it more deeply, find problems you didn’t know you had, and discover new questions to ask. Here’s where I started getting really interested:

We have optimized Honeycomb for speed, for rapid iteration, for explorability. Waiting even 10-15 seconds for a view to load will cut your focus, will take you out of the moment, will break the spell of exuberant creative flow.

On Finding Root Causes — Production Ready

Mathias Lafeldt rocks it again, this time with a great essay on finding root causes for an incident. I love the idea of using the term “Contributing Conditions” instead. And the Retrospective Prime Directive is so on-point I’ve gotta re-quote it here:

Regardless of what we discover, we understand and truly believe that everyone did the best job they could, given what they knew at the time, their skills and abilities, the resources available, and the situation at hand.

Simple testing can prevent most critical failures

This paper review by The Morning Paper reminds us of the importance of checking return codes and properly handling errors. Best part: solid statistical evidence.

1213486160 has a friend: 1195725856

A followup note on Rachel Kroll’s hilarious and awesome story about 1213486160 (a.k.a. “HTTP”). Basically, if you see a weird number showing up in your logs, it might be a good idea to try interpreting it as a string!

Netflix details chaos engineering

A solid basic primer on Netflix’s chaos engineering tools, with some info about the history and motivation behind them. I love the bit about how they ran into issues when Chaos Monkey terminated itself. Oops.

How to Handle an Outage Like a Pro

This article should really be titled, Make Sure Your DNS Is Reliable! It’s easy to forget that all the HA in the world won’t help your infrastructure if the traffic never reaches it due to a DNS failure. And here’s a really good corollary:

Even if your status site is on a separate subdomain, web host, etc… it will still be unavailable if your DNS goes down.

Exploring Airline Outages: A Developer’s Perspective

We’ve had a couple of high-profile airline computer system failures this year. Here’s an analysis of the difficulty companies are having bolting new functionality onto systems from the 90s and earlier, even as those systems try to support higher volume due to airline mergers. You may want to skip the bits toward the end that read like an ad, though.

The Accidental DBA

I don’t think I’ve ever been at a company with a dedicated DBA role. It’s becoming a thing of the past, and instead ops folks (and increasingly developers) are becoming the new DBAs. Charity Majors tells us that we need to apply proper operational principals to our datastores. One change at a time, proper deploy and rollback plans, etc.

GitHub – kamalmarhubi/shell-workshop

I love this idea: it’s an exercise in building your own command-line shell. It’s important to have a good grounding in the fundamentals of how processes get spawned and IO works in POSIX systems. Occasionally that’s the only way you can get to the root cause(s) of a really thorny incident.

Outages

Anik F2 (TV/telecom satellite)
eBay
GitHub
Twilio
National Australia Bank
Destiny (game)
Three (UK telecom)
Level 3 and many major US telecoms
- Level 3 mentioned a “configuration error”.
Outlook.com
PlayStation Network
Verizon
iTunes, App Store, and Apple Music
Netflix
Facebook

SRE Weekly Issue #42

lex

October 2, 2016

General

Comments

View on sreweekly.com

Articles

Making the Netflix API More Resilient

Netflix’s API has an advanced circuit-breaker system including a defined automated fallback plan for every dependency.

Just Culture

This is Sydney Dekker’s course on Just Culture, including a full explanation of Restorative Just Culture. I especially like the concept of Second Victims of incidents: the practitioner (e.g. engineer) that was directly involved in the incident.

Your practitioners are not necessarily the cause of the incident. They themselves are the recipients of trouble deeper inside your organization.

TCP Puzzlers

Think you know how TCP works? There are sneaky edge-cases that can cause an outage if you don’t know about them. Example: a MySQL replicating slave will happily report “0 seconds behind master” indefinitely while waiting on a connection to the master that’s long-since silently failed.

API First Transformation at Etsy – Operations

Etsy shares the operational issues they encountered as they moved toward an API/microservice architecture. I especially like the detail about limiting concurrent in-flight sub-requests per root request across the entire request tree.

Site Availability is for Everybody

My co-worker at Heroku, Stella Cotton, gave this rockin’ keynote at RailsConf 2016. She covers load testing and performance bottleneck diagnosis, and most of what she says applies not just to Rails.

How Uber Manages a Million Writes Per Second Using Mesos and Cassandra Across Multiple Datacenters

Here’s a summary of a talk about Uber’s system that stores live location data of riders and drivers. They run Cassandra in containers managed by Mesos.

Building a Scalable Minimum Viable Product

With an MVP, you’re just trying to get into the market and test the waters as quickly as possible, so there’s a temptation to leave considerations like scalability for later. But what if your MVP is unexpectedly successful?

Systems We Love

Systems We Love is a new conference modeled after the popular Papers We Love. It looks really interesting, and they’re saying they already have a lot of great proposals.

Travis CI: The day we deleted our VM images

Travis CI shares more about a major outage last month.

So You’ve Been Paged: A Guide to Incident Response (For Those Who Hate Being Paged)

A nice incident response primer from Scalyr.

Outages

Vodafone
- They’re offering a 2GB data credit.
Google Cloud Pub/Sub
Yahoo Mail
Netflix
Newsweek
- The outage occurred after they posted an article critical of trump, perhaps in retaliation, possibly by Russia. Allegedly.

SRE Weekly Issue #41

lex

September 25, 2016

General

Comments

View on sreweekly.com

Articles

Trestus by canonical-ols

Trestus is a new tool to generate a status page from a Trello board. Neat idea!

Your card can include markdown like any other Trello card and that will be converted to HTML on the generated status page, and any comments to the card will show up as updates to the status (and yes, markdown works in these too).

Writing Your First Postmortem

An excellent intro to writing post-incident analysis documents is the subject of this issue of Production Ready by Mathias Lafeldt. I can’t wait for the sequel in which he’ll address root causes.

The Morning Paper on Operability

Adrian Colyer of The Morning Paper gave a talk at Operability.IO with a round-up of his favorite write-ups of operations-related papers. I really love the fascinating trend of “I have no idea what I’m doing” — tools that help us infer interconnections, causality, and root causes in our increasingly complex infrastructures. Rather than try (and in my experience, usually fail) to document our massively complicated infrastructures in the face of increasing employee turnover rates, let’s just accept that this is impossible and write tools to help us understand our systems.

Tweets

And for fun, a couple of amusing tweets I came across this week:

Me: oh sorry, I got paged
Date: are you a doctor?
Me: uh
Nagios: holy SHIT this cert expires in SIXTY DAYS
Me: …yes

— Alice Goldfuss (@alicegoldfuss) (check out her awesome talk at SRECon16 about the Incident Command System)

We just accidentally nuked all our auto-scaling stuff and everything shutdown. We’re evidently #serverless now.

— Honest Status Page (@honest_update)

@mipsytipsy @ceejbot imagine you didn’t know anything about dentistry and decided we don’t need to brush our teeth any more. That’s NoOps.

— Senior Oops Engineer (@ReinH)

Zuul 2: The Netflix Journey to Asynchronous, Non-Blocking Systems

Netflix documents the new version of their frontend gateway system, Zuul 2. They moved from blocking IO to async, which allows them to handle persistent connections from clients and better withstand retry storms and other spikes.

The advantages of async systems sound glorious, but the above benefits come at a cost to operations. […] It is difficult to follow a request as events and callbacks are processed, and the tools to help with debugging this are sorely lacking in this area.

Uber hits bumps in the road with microservices challenges

In last week’s issue, I linked to a chapter from Susan Fowler’s upcoming book on microservices. Here’s an article summarizing her recent talk at Velocity about the same subject: how to make microservices operable. She should know: Uber runs over 1300 microservices. Also summarized is her fellow SRE Tom Croucher’s keynote talk about outages at Uber.

Introducing the GitHub Load Balancer

In this first of a series, GitHub lays out the design of their new load balancing solution. It’s pretty interesting due to a key constraint: git clones of huge repositories can’t resume if the connection is dropped, so they need to avoid losing connections whenever possible.

Book Review: Site Reliability Engineering – How Google Runs Production Systems

I’m embarrassed to say that I haven’t yet found the time to take my copy of the SRE book from its resting place on my shelf, but here’s another review with a good amount of detail on the highlights of the book.

TCP connection repair

Live migration of VMs while maintaining TCP connections makes sense — the guest’s kernel holds all the connection state. But how about live migrating containers? The answer is a Linux feature called TCP connection repair.

SSP accused of making ‘wrong call’ over decision not to use secondary data centre after outage

The SSP story (linked here two issues ago) is getting even more interesting. They apparently decided not to switch to their secondary datacenter in order to avoid losing up to fifteen minutes’ worth of data, instead taking a week+ outage.

Learning From UCLA

While, in SRE, we generally don’t have to worry about our deploys literally blowing up in our faces and killing us, I find it valuable to look to other fields to learn from how they manage risk. Here’s an article about a tragic accident at UCLA in which a chemistry graduate student was severely injured and later died. A PhD chemist I know mentioned to me that the culture of safety in academia is much less rigorous than in the industry, perhaps due in part to a differing regulatory environment.

Outages

Destiny (game)
Pokemon GO
- Not to be confused with poke Mongo.
Pingdom
- An outage in both the admin/API and actual monitoring.
ASX (Australia Stock Exchange)
Global Switch (London datacenter)
Phoenix (civil service pay system)
- The system used by Canada to pay its civil servants went down the day before payday.
T-Mobile US
Fonality

← Older Posts

Newer Posts →

General

SRE Weekly Issue #45

Articles

Outages

SRE Weekly Issue #44

Articles

Outages

SRE Weekly Issue #43

Articles

Outages

SRE Weekly Issue #42

Articles

Outages

SRE Weekly Issue #41

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

General

SPONSOR MESSAGE

Articles

Outages

SPONSOR MESSAGE

Articles

Outages

SPONSOR MESSAGE

Articles

Outages

SPONSOR MESSAGE

Articles

Outages

SPONSOR MESSAGE

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues