General

SRE Weekly Issue #49

lex

November 27, 2016

My vacation visiting the Mouse was awesome! I had a lot of time to work on a little project of mine. And now I’m back just in time for Black Friday and Cyber-whatever. Good luck out there, everyone!

Articles

Price-Checking for Prescriptions Results in Dangerous Combination of Medications

Another issue of BWH’s Safety Matters, this time about a prescribing accident. The system seems to have been almost designed to cause this kind of error, so it’s good to hear that a fix was already about to be deployed.

Do You Make This Critical Root Cause Analysis (RCA) Mistake?

This is a great article on identifying the true root cause(s) of an incident, as opposed to stopping with just a “direct cause”. I only wish it were titled, Use These Five Weird Tricks to Run Your RCA!

How Etsy Uses Code “Slush” to Manage Development During the Holidays

Etsy describes how they do change management during the holidays:

[…] at Etsy, we still push code during the holidays, just more carefully, it’s not a true code freeze, but a cold, melty mixture of water and ice. Hence, Slush.

Production Ready: Always Leave the Campground Cleaner Than You Found It

This issue of Production Ready is about battling code rot through incrementally refactoring an area of a codebase while you’re doing development work that touches it.

5 ways to hone your production incident postmortems

Shutterstock shares some tips they’ve learned from writing postmortems. My favorite part is about recording a timeline of events in an incident. I’ve found that reading an entire chat transcript for an incident can be tedious, so it can be useful to tag items of interest using a chat-bot command or a search keyword so that you can easily find them later.

OUTAGE! AMA on-demand video

The “Outage!” AMA happened while I was on vacation, and I still haven’t had a chance to listen to it. Here’s a link in case you’d like to.

10 DevOps Interview Questions to Gauge a Candidate’s Real Knowledge

My favorite:

If something breaks in production, how do you know about it?

Weaver: Ill-Behaved Microservice Emulator

Weaver is a tool to help you identify problems in your microservice consumers by doing “bad” things like responding slowly to a fraction of requests.

How Barclays Avoids Downtime Chaos

Barclays reduced load on their mainframe by adding MongoDB as a caching layer to handle read requests. What the heck does “mainframe” mean in this decade, anyway?

SOASTA Report: Online Holiday Shoppers Will Only Wait for Two Seconds

We’d do well to remember during this holiday season that several seconds of latency in web requests is tantamount to an outage.

You Are Not Paid to Write Code

Tyler Treat gives us an eloquently presented argument for avoiding writing code as much as possible, for the sake of stability.

Outages

So far, no big-name Black Friday outages. We’ll see what Cyber Monday has in store.

Everest (datacenter)
- Everest suffered a cringe-worthy network outage subsequent to a power failure. Power came off and on a couple of times, prompting their stacked Juniper routers to assume they’d failed in booting and go into failure recovery mode. Unfortunately, the secondary OS partitions on the two devices contained different JunOS versions, so they wouldn’t stack properly.
  
  I’d really like to read the RFO on the power outage itself, but I can’t find it. If anyone has a link, could you please send it my way?
Argos

SRE Weekly Issue #48

lex

November 13, 2016

General

Comments

View on sreweekly.com

This is the first issue of SRE Weekly going out to over 1000 email subscribers! Thanks all of you for continuing to make my little side project so rewarding and fulfilling. I can’t believe I’m almost at a year.

Speaking of which, there won’t be an issue next week while my family and I are vacationing at Disney World. See you in two weeks!

Articles

When downtime is not an option

A detailed description of Disaster Recovery as a Service (DRaaS), including a discussion of the cost versus creating a DR site oneself. This is the part I always wonder about:

However, for larger enterprises with complex infrastructures and larger data volumes spread across disparate systems, DRaaS has often been too complicated and expensive to implement.

The Prime Directive

This one’s so short I can almost quote the whole thing here. I love its succinctness:

Regardless of what we discover, we understand and truly believe that everyone did the best job they could, given what they knew at the time, their skills and abilities, the resources available, and the situation at hand.

The Netflix Tech Blog: Post-mortem of October 22,2012 AWS degradation

Just over four years ago, Amazon had a major outage in Elastic Block Store (EBS). Did you see impact? I sure did. Here’s Netflix’s account of how they survived the outage mostly unscathed.

Serverless promises and the persistent need for critical alerting

I’m glad to see more people writing that Serverless != #NoOps. This article is well-argued even though it turns into an OnPage ad 3 paragraphs from the end.

Episode 004: Charity Majors – Greater Than Code

What else can we expect from Greater Than Code + Charity Majors? This podcast is 50 minutes of awesome, and there’s a transcription, too! Listen/read for awesome phrases like “stamping out chaos”, find out why Charity says, “I personally hate [the term ‘SRE’] (but I hate a lot of things)”, and hear Conway’s law applied to microservices, #NoOps debunking, and a poignant ending about misogyny and equality.

Microsoft Announces Azure DNS General Availability

Microsoft released its Route 53 competitor in late September. They say:

Azure DNS has the scale and redundancy built-in to ensure high availability for your domains. As a global service, Azure DNS is resilient to multiple Azure region failures and network partitioning for both its control plane and DNS serving plane.

Communication Breakdown Leads to Patient Burn

This issue of BWH Safety Matters details an incident in which a communication issue between teams that don’t normally work together resulted in a patient injury. This is exactly the kind of pitfall that becomes more prevalent with the move toward microservices, as siloed teams sometimes come into contact only during an incident.

New systems will fail: A site outage case study from Envato Market

A detailed postmortem from an outage last month. Lots of takeaways, including one that kept coming up: test your emergency tooling before you need to use it.

Outages

Canada’s immigration site
- I’m sure this is indicative of something.
Office 365
Twitter
- Twitter stopped announcing their AS in BGP worldwide, resulting in a 30-minute outage on Monday.
Google BigQuery
- Google writes really great postmortems! Here’s one for a 4-hour outage in BigQuery on November 8, posted on November 11. Fast turnaround and an excellent analysis. Thanks, Google — we appreciate your hard work and transparency!
Pingdom
- Normally I wouldn’t include such a minor outage, but I love the phrase “unintended human error” that they used. Much better than the intended kind.
WikiLeaks
eBay

SRE Weekly Issue #47

lex

November 6, 2016

General

Comments

View on sreweekly.com

Articles

SRECon17 Americas Call for Participation

Next year, SRECon is expanding to three events: Americas, EMEA, and Asia. The Americas event is also moving from Santa Clara to San Francisco, which I, for one, am especially grateful for. The CFP for SRECon17 Americas just opened up, and proposals are due November 30th, so get cracking! I can’t wait to see what all of you have to share!

Introducing anomaly detection in Datadog

I have a somewhat dim view of automated anomaly detection in metrics based on my experience with a few tools, but if Datadog’s algorithms live up to their description, they might really have something worthwhile.

When a responder gets an anomaly alert, he or she needs to know exactly why the alert triggered. The monitor status page for anomaly alerts shows what the metric in question looked like over the alert’s evaluation window, overlaid with the algorithm’s predicted range for that metric.

From Zero to Staging and Back

This issue of Production Ready chronicles Mathias Lafeldt’s effort to create a staging environment. I like the emphasis on using an entirely separate AWS account for staging. This is increasingly becoming a best practice.

Honeycomb :: Nylas Guest Post: Ghosts in the WSGI Machine

What’s causing all that API request latency? Here’s an interesting debug run using Honeycomb. Negative HTTP status codes? Sure, that’s totally a thing, right?

The Irreproducibility of Bugs in Large-Scale Production Systems

I love this idea: Susan Fowler notes that large, complex systems are constantly changing, and this makes reproducing bugs difficult or impossible. Her suggestion is to log enough that you can logically reconstruct the state of the system at the time the bug occurred. This is the same kind of thing the Honeycomb folks are saying: throw a lot of information into your logs, just in case you might need it to debug something unforeseen.

Outages

Instagram
Level 3
- Another big Level 3 outage.
Battlefield 1 (game)
AT&T
- Three separate outages.
FIFA 17 (game)
Unprecedented cyber attack takes Liberia’s entire internet down
- The attackers used the Mirai botnet, the same one used to attack Dyn.

SRE Weekly Issue #46

lex

October 30, 2016

General

Comments

View on sreweekly.com

This may be the biggest issue to date. Lots of great articles this week, plus updates from the Dyn DDoS, and of course all of the awesome content I held off on posting last week.

Articles

Being an Effective Ally to Women and Non-Binary People

I’ve linked to several posts on Etsy’s Code as Craft blog in the past, and here’s another great one. Perhaps not the typical SRE article you might have been expecting me to link to, but this stuff is important in every tech field, including SRE. We can’t succeed unless every one of us has a fair chance at success.

Hacker puts ‘full redundancy’ code-hosting firm out of business

In 2014, CodeSpaces suffered a major security breach that abruptly ended their business. I’d say that’s a pretty serious reliability risk right there, showing that security and reliability are inextricably intertwined.

OUTAGE! AMA

Check it out! Catchpoint is doing another Ask Me Anything, this time about incident response. Should be interesting!

Complex System Failures and Blameless Retrospectives by Courtney Eckhardt

My fellow Heroku SRE, Courtney Eckhardt, expanded on a section of our joint SRECon talk for this session at OSFeels. She had time for Q&A, and there were some really great questions!

The Power of Less Code

Mathias rocks it, as usual, in this latest issue of Production Ready.

The Netflix Tech Blog: Netflix Chaos Monkey Upgraded

Netflix has released a new version of Chaos Monkey with some interesting new features.

The Myth of the Root Cause: How Complex Web Systems Fail

Scalyr worked with Mathias Lafeldt to turn his already-awesome pair of articles into this excellent essay. He brings in real-world examples of major outages and draws conclusions based on them. He also hits on a number of other topics he’s written about previously. Great work, folks!

octocatalog-diff: GitHub’s Puppet development and testing tool – GitHub Engineering

How many times have you pushed out a puppet change that you tested very thoroughly, only to find that it did something unexpected on a host you didn’t even think it would apply to? Etsy’s solution to that is this tool that shows catalog changes for all host types in your fleet in a diff-style format.

Pokemon Go: How the cloud saved the smash hit game from collapse

“We ended up 50 percent over our worst case after day one, we figured this was going to be bad within six hours”

It’s pretty impressive to me that Niantic managed to keep Pokemon Go afloat as well as they did. They worked very closely with Google to grow their infrastructure much faster than they had planned to.

Spotify Engineering: Making Ops Human

As Spotify has grown to 1400 microservices and 10,000 servers, they’ve moved toward a total ownership model, in which development teams are responsible for their code in production.

How the Friday DDoS attack affected Pingdom

Pingdom suffered a major outage during the Dyn DDoS, not only due to their own DNS-related issues, but also due to the massive number of alerts their system was trying to send out to notify customers that their services were failing.

[…] at 19:20 we went to DEFCON 1 and suspended the alerting service altogether.

Dyn Analysis Summary Of Friday October 21 Attack

Here’s Dyn’s write-up of the DDoS.

Service Disruption Root Cause Analysis and Follow-up Actions from October 21st, 2016

As they promised, here’s PagerDuty’s root cause analysis from the Dyn DDoS.

Routing around single point of failure DNS issues

This is a pretty great idea. Ably has written their client libraries to reach their service through a secondary domain if the primary one is having DNS issues. Interestingly, their domain, ably.io, was also impacted by the .io TLD DNS outage (detailed below) just days after they wrote this.

Looking Back On The Largest DDoS In History

StatusPage.io gives us some really interesting numbers around the Dyn DDoS, based on status posts made by their customers. I wonder, was it really the largest in history?

Serverless Operations is Not a Solved Problem

Here’s a nice write-up of day one of ServerlessConf, with the theme, “NoOps isn’t a thing”.

AWS Server Migration Service

Not a magic bullet, but still pretty interesting.

[…] it allows you to incrementally replicate live Virtual Machines (VMs) to the cloud without the need for a prolonged maintenance period. You can automate, schedule, and track incremental replication of your live server volumes, simplifying the process of coordinating and implementing large-scale migrations that span tens or hundreds of volumes.

Reset router could have saved Census

Earlier this year, Australia’s online census site suffered a major outage. Here’s a little more detail into what went wrong. TL;DR: a router dropped its configuration on reboot.

How GOV.UK Reduced their Incidents and Alerts

Gov.uk has put in place a lot of best practices in incident response and on-call health.

After extensive rationalisation, GOV.UK have reached a stage where only 6 types of incidents can alert (wake them up) out of hours. The rest can wait until next morning.

Unfortunately, I’m guessing one of those six types happened this week, as you can see in the Outages section below.

Outages

.io TLD
- The entire .io top-level domain went down, resulting in impact to a lot of trendy companies that use *.io domains. It doesn’t matter how many DNS providers you have for your domain if your TLD’s nameservers aren’t able to give out their IPs. Worse yet, .io‘s servers sometimes did respond, but with an incorrect NXDOMAIN for valid domains. .io‘s negative-caching TTL of 3600 seconds made this pretty nasty.
  
  On the plus side, this outage provided the last piece in the puzzle in answering my question, “does ‘fast-fluxing’ your DNS providers really work?”. Answer: no. I’ll write up all of my research soon and post a link to it here.
The Pirate Bay
California DMV
AT&T
British Telecom
gov.uk
PlayStation Network

SRE Weekly Issue #45

lex

October 23, 2016

General

Comments

View on sreweekly.com

This past Friday, it was as if Oprah were laughing maniacally, shouting, “You get an outage, and you get an outage, EVERYONE GETS AN OUTAGE!” Hat-tip to all of you who spent the day fighting fires like I did.

I’ve decided to dedicate this week’s issue entirely to the incident, the fallout, and information on redundant DNS solutions. I’ll save all of the other great articles from this week for the next issue.

Articles

DNS devastation: Top websites whacked offline as Dyn dies again

The Register has a good overview of the attacks.

Dyn Statement on 10/21/2016 DDoS Attack

Dyn released this statement on Saturday with more information on the outage. Anecdotally (based on my direct experience), I’m not sure their timeline is quite right, as it seems that I and others I know saw impact later than Dyn’s stated resolution time of 1pm US Eastern time. Dyn’s status post indicates resolution after 6pm Eastern, which matches more closely with what I saw.

AWS

Among many other sites and services, AWS experienced an outage. They posted an unprecedented amount of detail during the incident in a banner on their status site. Their status page history doesn’t include the banner text, so I’ll quote it here:

These events were caused by errors resolving the DNS hostnames for some AWS endpoints. AWS uses multiple DNS service providers, including Amazon Route53 and third-party service providers. The root cause was an availability event that occurred with one of our third-party DNS service providers. We have now applied mitigations to all regions that prevent impact from third party DNS availability events.

Nice job with the detailed information, AWS!

DDoS on Dyn Impacts Twitter, Spotify, Reddit

Krebs on Security notes that the attack came hours after a talk on DDoS attacks at NANOG. Krebs was a previous target of a massive DDoS, apparently in retaliation for his publishing of research on DDoS attacks.

PagerDuty

Paging through PagerDuty was down or badly broken throughout Friday. Many pages didn’t come through, and those that did sometimes couldn’t be acknowledged. Some pages got stuck and PagerDuty’s system would repeatedly call engineers and leave the line silent. [source: personal experience] Linked is PagerDuty’s “Initial Outage Report”. It includes a preliminary explanation of what went wrong, an apology, and a pledge to publish two more posts: a detailed timeline and a root cause analysis with remediation items.

How We Survived the Dyn DNS Outage

Sumo Logic implemented dual nameserver providers after a previous DNS incident. This allowed them to survive the Dyn outage relatively unscathed.

WikiLeaks supporters claim credit for massive U.S. cyberattack, but researchers skeptical

Anonymous and New World Hackers have claimed responsibility for Friday’s attack, but that claim may be suspect. Their supposed reasons were to “retaliate for the revocation of Julian Assange’s Internet access” and to “test their strength”.

WikiLeaks urges supporters to ‘stop taking down the US internet’ following DDoS cyberattack

WikiLeaks bought their claim and asked supporters to cut it out.

Recursive and Iterative Queries

A great basic overview of how recursive DNS queries work, published by Microsoft.

Chapter 2 DNS Concepts, from Pro DNS and Bind by Ron Aitchison

A much more detailed explanation if how DNS works. Recursive and iterative DNS queries are covered in more depth, along with AXFR/IXFR/NOTIFY which can be used to set up a redundant secondary DNS provider for your domain.

Heroku Status Site Incident 965: Issues with DNS resolution

Heroku was among many sites impacted by the attack. Heroku’s status site was also impacted, prompting them to create a temporary mirror. This is painfully reminiscent of an article linked two issues ago about making sure your DNS is redundant.

Full disclosure: Heroku is my employer.

DOS Attacks and DNS: How to Stay Up If Your DNS Provider goes DOWN

EasyDNS has this guide on setting up redundant DNS providers to safeguard against an attack like this. However, I’m not sure about their concept of “fast fluxing” your nameservers, that is, changing your nameserver delegation with your registrar when your main DNS provider is under attack.

Unless I’m mistaken, a change in nameservers for a domain can take up to 2 days (for .com) to be seen by all end-users, because their ISP’s recursive resolver could have your domain’s NS records cached, and the TTL given by the .com nameservers is 2 days.

Am I missing something? Maybe recursive resolvers optimistically re-resolve your domain’s NS records if all of your nameservers time out or return SERVFAIL? If you know more about this, please write to me!

Outages

I’m not going to bother naming all of the many sites and services that went down on Friday, because there were too many and I’d inevitably miss some. The below outages all occurred earlier in the week.

PSN
The New York Times
Google Compute Engine Load Balancers
- GCE’s LB offering returned progressively more 502s over the course of 2 hours, with a peak of 45% of requests failing. Linked is Google’s postmortem of the incident, which seems to have been a bad deploy.
Virgin Australia website

← Older Posts

Newer Posts →

General

SRE Weekly Issue #49

Articles

Outages

SRE Weekly Issue #48

Articles

Outages

SRE Weekly Issue #47

Articles

Outages

SRE Weekly Issue #46

Articles

Outages

SRE Weekly Issue #45

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

General

SPONSOR MESSAGE

Articles

Outages

SPONSOR MESSAGE

Articles

Outages

SPONSOR MESSAGE

Articles

Outages

SPONSOR MESSAGE

Articles

Outages

SPONSOR MESSAGE

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues