General

SRE Weekly Issue #91

I’m heading to New York tomorrow and will be at Velocity Tuesday and Wednesday. If you’re there, look for the weirdo in the SRE Weekly shirt and hit me up for some nifty swag! Also, maybe check out my talk on DNS, if you’re into that kind of thing.

Thanks to an eagle-eyed reader for pointing out that I totally screwed up the HTML on the link last week. Oops.

SPONSOR MESSAGE

Like DevOps? Register for All Day DevOps – a FREE online conference this October, offering 100 DevOps-focused sessions across six different tracks. Learn more & register:
http://bit.ly/2waBukw

Articles

Here’s how Hosted Graphite made their job ad for an SRE-like role (Ops Automation Engineer) more inclusive. The article is filled with specific before/after language snippets, each with a detailed explanation of why they made the change.

A couple weeks after their major outage last October, Dyn published this article explaining secondary DNS. It’s a great primer and digs into what to do if you use advanced non-standard functionality like ALIAS records and traffic balancing.

SignalFx goes into deep detail on their feature for predicting future metric values. We get an explanation of why prediction is difficult and a discussion of the math involved in their solution.

Payments: we really have to get them right. Here’s DropBox’s Jessica Fisher with a discussion of how they reconcile failed payments.

No matter what goes wrong, our top priority is to make sure that customers receive service for which they’ve been charged, and aren’t charged for service they haven’t received.

A couple of weeks ago, I linked to a story about Resilience4j, a fault tolerance library for Java. This week is the second installment that shows you how to use it to implement circuit breakers. There’s also an interesting discussion of one of the implementation details.

Here’s a cute little debugging story. Turns out ntpd has a bit of a blind spot!

Adcash CTO Arnaud Granal gives us a rare glimpse into the multiple iterations of their infrastructure. Hear what worked well and what didn’t as they scaled to handle 500k requests per second at peak.

Outages

  • OpenSRS (DNS provider)
    • OpenSRS (registrar and DNS provider, among other services) had a major outage in their DNS service.

      At 1AM UTC we were the target of a sophisticated DNS attack that was followed by an unrelated double failure of core network equipment at our main Canadian data center, caused by an undocumented software limitation.

      Yikes.

  • Amadeus (airline booking system)
    • Amadeus provides the technical underpinnings of many airlines around the world. They had issues this past week, taking a lot of airlines with them.
  • SourceForge
    • Our [data center] hosting provider has been having issues with a power distribution unit.

  • Facebook

SRE Weekly Issue #90

A couple of DNS-related links this week.  I’ll be giving a talk at Velocity NYC on all of the fascinating things I learned about DNS in the wake of the Dyn DDoS and the .io TLD outage last fall.  If you’re there, hit me up for some SRE Weekly swag!

SPONSOR MESSAGE

Like DevOps? Register for All Day DevOps – a FREE online conference this October, offering 100 DevOps-focused sessions across six different tracks. Learn more & register:
http://bit.ly/2waBukw

Articles

We’re all becoming distributed systems engineers, and this stuff sure isn’t easy.

Isn’t distributed programming just concurrent programming where some of the threads happen to execute on different machines? Tempting, but no.

Every-second canarying is a pretty awesome concept. Not only that, but they even post the results on their status page. Impressive!

So many lessons! My favorite is to make sure you test the “sad path”, as opposed to just the “happy path”. If a customer screws up their input and then continues on correctly from there on, does everything still work?

Extensive notes taken during 19 talks at SRECon 17 EMEA. I’m blown away by the level of detail. Thanks, Aaron!

A cheat sheet and tool list for diagnosing CPU-related issues. There’s also one on network troubleshooting by the same author. Note: LinkedIn login required to view.

Antifragility is an interesting concept that I was previously unaware of. I’m not really sure how to apply it practically in an infrastructure design, but I’m going to keep my eye out for antifragile patterns.

It’s easy to overlook your DNS, but a failure can take your otherwise perfectly running infrastructure down — at least from the perspective of your customers.

Do you run a retrospective on near misses? The screws they tightened in this story could just as easily be databases quietly running at max capacity.

A piece of one of the venting systems fell and almost hit an employee which almost certainly would have caused a serious injury and possibly death. The business determined that (essentially) a screw came loose causing the part to fall. It then checked the remaining venting systems and learned that other screws had starting becoming loose as well and was able to resolve the issue before anyone got hurt.

Oh look, Azure has AZs now.

The transport layer in question is gRPC, and this article discusses using it to connect a microservice-based infrastructure. If you’ve been looking for an intro to gRPC, check this out.

How do you prevent human error? Remove the humans. Yeah, I’m not sure I believe it either, but this was still an interesting read just to learn about the current state of lights-out datacenters.

This is a really neat idea: generate an interaction diagram automatically using a packet capture and a UML tool.

Thanks to Devops Weekly for this one.

Outages

  • .io
    • The .io TLD went down again, in exactly the same way as last fall.
  • PagerDuty
    • PagerDuty suffered a major outage lasting over 12 hours this past thursday. Customers scrambled to come up with other alerting methods.
      Some really excellent discussion around this incident happened on the hangops slack in the #incident_response channel. I and others requested more details on the actual paging latency and PagerDuty delivered them on their status site. Way to go, folks!
  • StatusPage.io
    • I noticed this minor incident after getting a 500 reloading PagerDuty’s status page.
  • The Travis CI Blog: Sept 6 – 11 macOS outage postmortem
    • This week, Travis posted this followup describing the SAN performance issues that impacted their system.
  • Outlook and Hotmail

SRE Weekly Issue #89

SPONSOR MESSAGE

Acknowledge and resolve IT & DevOps alerts directly from Slack with the new native integration with VictorOps. Learn all about it here:
http://try.victorops.com/slack/SREWeekly

Articles

Cachet looks like a pretty good contender to incumbents like StatusPage.

Hosted Graphite used PySyncObj to create a fault-tolerant threshold alerting feature.

Talk about a high-pressure incident! When a teleconferencing provider’s wires got crossed, hilarity (and embarassment) ensued.

This article is published by my sponsor, VictorOps, but their sponsorship did not influence its inclusion in this issue.

This story is from a PagerDuty engineer. What’d you learn while shadowing on-call? I’d love to hear your story!

Here’s how SYNQ set their status page up. They’re the folks that committed to publishing all of their incident followups publicly a month or two back. Transparency FTW!

I’ll save you the math: that’s ~17k req/sec. I really like that this article takes us through their learning process and their first failed attempts.

Quid wrote up this explanation of how they set up their game day and what they learned. I really like the structure they used, and I may draw heavily on it for my own game days.

“Observability” as a term is making the rounds like “DevOps” did (and still does…). Here’s Baron Schwartz’s take on it.

Outages

  • Google Services
    • As two astute readers pointed out (thanks!), the Gmail outage I included in the last issue was from 2009(!). Oops. However, Google has been experiencing a series of outages and degradations this month, so I’m just going to pretend I knew that rather than that I forgot to check the date on the article.
  • s3 outage
    • S3 had an outage in us-east-1 on September 14th. This one showed up as yellow on their status site, with the text below. Companies that depend on S3 probably saw impact as well, but I couldn’t find any status posts other than Heroku’s.

      11:58 AM PDT We are investigating increased error rates for Amazon S3 requests in the US-EAST-1 Region.
      12:20 PM PDT We can confirm that some customers are receiving throttling errors accessing S3. We are currently investigating the root cause.
      12:38 PM PDT We continue to work towards resolving the increased throttling errors for Amazon S3 requests in the US-EAST-1 Region. We have identified the subsystem responsible for the errors, identified root cause and are now working to resolve the issue.
      12:49 PM PDT We are now seeing recovery in the throttle error rates accessing Amazon S3. We have identified the root cause and have taken actions to prevent recurrence.
      1:05 PM PDT Between 11:40 AM and 12:56 PM PDT we experienced throttling errors accessing Amazon S3 in the US-EAST-1 Region. The issue is resolved and the service is operating normally.

      Full disclosure: Heroku is my employer.

  • IBM
    • IBM had a mishap when transferring control of some of its domains to a different registrar. Some of their services including their Global Load Balancer went down.

SRE Weekly Issue #88

SPONSOR MESSAGE

Acknowledge and resolve IT & DevOps alerts directly from Slack with the new native integration with VictorOps. Learn all about it here:
http://try.victorops.com/slack/SREWeekly

Articles

From Catie McCaffrey:

I’m often asked how to get started with Distributed Systems, so this post documents my path and some of the resources I found most helpful. It is by no means meant to be an exhaustive list.

Julia Evans just blew my mind (once again). In this article, among other things, she links to a tool that tells you which function in the kernel dropped a packet. I’ve been wishing for such a tool for years!

I love that companies are starting to publish lessons learned from game days and other chaos experiments. Just like a post-incident followup, there’s so much we can learn by following along.

It’s an absolute must for any disaster recovery plan worth its name to include power supply as a crucial factor – because, without power, you simply can’t do business.

Here’s the last installment of Jason Hand’s digest version of his new eBook, Post-Incident Reviews.

If I leave you with one take-away from this guide, it should be that every incident provides an opportunity for your team to be more prepared for the next one.

This article is published by my sponsor, VictorOps, but their sponsorship did not influence its inclusion in this issue.

How can you prevent a colo failure? Obviously, colo customers can’t, but we can at least prepare. This article has advice for understanding a provider’s history, policies, and procedures related to outages.

Just click through.

In this analysis of the factors leading to a plane crash, we see another example of the critical role that human/computer interfaces play in allowing (or preventing) humans to recover from a system failure.

Move over, backhoes: water is the other natural enemy of the fiber optic network.

The New York Times has a Kafka installation containing everything they’ve published in their entire history, and it powers the front page, search, suggestions, and everything else.

Outages

  • AbeBooks.com
    • AbeBooks is the place to go for out-of-print books and old editions. The site going down meant that many used booksellers lost a major sales outlet.
  • Gmail
  • Apple developer portal
  • Google Drive
  • iCloud Mail
  • Heroku
    • Heroku posted a pile of public followups this past week:
      • Incidents 1251 and 1254 – In both of these incidents, applications failed due to missing debian packages normally provided by the Heroku platform.
      • Incident 1257 – For a few minutes, 10% of requests to Heroku applications hosted in Europe failed.
      • Incident 1270 – Applications last deployed over 3 years ago spontaneously stopped working.

      Full disclosure: Heroku is my employer.

SRE Weekly Issue #87

SPONSOR MESSAGE

More efficiently reach on-call teams and incident responders with a new way to deploy Live Call Routing using Twilio Functions and VictorOps. Check it out:
http://try.victorops.com/LiveCallRouting/SREWeekly

Articles

John Allspaw describes the Architecture Review Working Group at Etsy. I like the idea of an open discussion with peers before creating a novel system that will add significant operational burden.

Here’s part two of Jason Hand’s series of posts with key takeaways from his new eBook, “Post-Incident Reviews”. In the next three chapters, he shows why a traditional RCA process misses the mark.

[…] problems stem — not from one primary cause — but from the complex interplay of our systems and the teams tasked with managing them.

Honeycomb.io eschews plain monitoring in favor of “observability”, which they define as the ability to “ask any arbitrary question” about a system.

But here’s the thing: in distributed systems, or in any mature, complex application of scale built by good engineers… the majority of your questions trend towards the unknown-unknown.

Here’s another primer on microservices. It has a nice “caveats” section, which is exactly where operations and reliability come into the picture.

Facebook shared a lot of detail about how they evolved from 3 daily pushes to quasi-continuous releases. They’ve got a well-defined canary system, reminding me of Charity’s article on testing in production last week.

AppDynamics presents their list in shiny PDF form. You’ll have to fill in your spam-bucket address contact info to download it.

PagerDuty is hosting a “breakathon”: small teams will compete to resolve a series of infrastructure issues. Sounds like bunch of fun!

Outages

  • Japan
    • Google accidentally announced some BGP prefixes it shouldn’t have, taking Japan offline for a couple of hours. Linked above is a really neat in-depth analysis from BGPmon, for all you BGP geeks out there.

      Since Google essentially leaked a full table towards Verizon, we get to peek into what Google’s peering relationships look like and how their peers traffic engineer towards Google.

  • Heroku
  • AWS
    • EC2’s Ireland region suffered an outage in VPC peering on August 23. Their status site doesn’t allow for deep links, so here’s an excerpt:

      11:32 AM PDT We are investigating network connectivity issues for some instances in the EU-WEST-1 Region.

      11:55 AM PDT We have identified root cause of the network connectivity issues in the EU-WEST-1 Region. Connectivity between peered VPCs is affected by this issue. Connectivity between instances within a VPC or between instances and the Internet or AWS services is not affected. We continue to work towards full recovery.

      12:51 PM PDT Between 10:32 AM and 12:44 PM PDT we experienced connectivity issues when using VPC peering in the EU-WEST-1 Region. Connectivity between instances in the same VPC and from instances to the Internet or AWS services was not affected. The issue has been resolved and the service is operating normally.

  • Google Cloud
    • Google Cloud suffered a massive 30-hour worldwide outage in some cloud load-balancers. In their impressive style, they posted frequent updates during the incident and issued a followup analysis of the incident just 2 days after resolution.

      In order to prevent the issue, Google engineers are working to enhance automated canary testing that simulates live-migration events, detection of load balancing packets loss, and enforce more restrictions on new configuration changes deployment for internal representation changes.

  • WhatsApp
  • Twitch (video streaming service)
A production of Tinker Tinker Tinker, LLC Frontier Theme