General

SRE Weekly Issue #56

SPONSOR MESSAGE

It’s time to fix your incident management. Built for DevOps, VictorOps helps you respond to incidents faster and more effectively. Try it out for free.

Articles

If you have a minute (it’ll only take one!), would you please fill out this survey? Gabe Abinante (featured here previously) is gathering information about the on-call experience with an eye toward presenting it at Monitorama.

Wow, what a resource! As the URL says, this is “some ops for devs info”. Tons of links to useful background for developers that are starting to learn how to do operations. Thanks to the author for the link to SRE Weekly!

AWS Lambda response time can increase sharply if your function is accessed infrequently. I love the graphs in this post.

A top-notch article on how to avoid common load-testing pitfalls. Great for SREs as well as developers!

A description of an investigation into poor performance in a service with a 100% < 5ms SLA.

Docker posted this article on how they designed InfraKit for high availability.

No!!

A blanket block of ICMP on your network device breaks some important features like ping, traceroute, MTU discovery, and the like. MTU discovery (Fragmentation Required) is especially important, and ignoring it can cause connections to appear to time out for no obvious reason.

Outages

SRE Weekly Issue #55

SPONSOR MESSAGE

It’s time to fix your incident management. Built for DevOps, VictorOps helps you respond to incidents faster and more effectively. Try it out for free.

Articles

Nothing is worse than finding out that your confidence in your backup strategy was ill-founded (the hard way). Facebook prevents this with what is, in retrospect, a blatantly obvious idea that I never thought of: continuously, automatically testing your backups by trying to restore them.

Route 53 can do failover based on health checks, but it doesn’t know how to check if a database is healthy. This article discusses using an HTTP endpoint that checks the status of the DB and returns status 200 or 500 depending on whether the DB is up. There’s also a discussion of how to handle failure of the HTTP endpoint itself.

Chaos Monkey was designed with the idea of having it run all the time on a schedule, but as Mathias Lafeldt shares, you can also (or even exclusively) trigger failures through an API. He even wrote a CLI for the API.

Here’s a link shared with me by its author. If you write something you think other SREs will like, please don’t hesitate to send it my way! I love this article, because load testing is yet another aspect of the growing trend toward developers owning the operation of their code.

This article is short and sweet. There are four rock-bottom metrics that you really need to figure out if something is wrong with your service. They had me at “Downstreamistan”.

This description of Chaos Engineering is more rigorous than casual articles, making for a pretty interesting read even if you already know all about it.

Although the term “chaos” evokes a sense of unpredictability, a fundamental assumption of chaos engineering is that complex systems exhibit behaviors regular enough to be predicted.

I haven’t had a chance to watch this yet, but the description is riveting even by itself. Click through for a link to play the documentary directly.

Outages

  • Second Life
    • One transit provider failed and automatic failover didn’t work. Once they were back up, the subsequent thundering herd of logins threatened to take them back down. Click through for a detailed post-analysis.
  • S3, EC2 API
    • On January 10, S3 had issues processing DELETE requests (though you wouldn’t know it from looking at the history section of their status page). Various (presumably) dependent services such as Heroku and PackageCloud.io had simultaneous outages.

      Full disclosure: Heroku is my employer.

  • Lloyds Bank
  • Mailgun
  • Battlefield 1
  • Facebook

SRE Weekly Issue #54

SPONSOR MESSAGE

The “2016/17 State of On-Call” report from VictorOps is now available to download. Learn what 800+ respondents have to say about life on-call, and steps they’re taking to make it better. Get your free copy here: https://victorops.com/state-of-on-call

Articles

Wow! PagerDuty made waves this week by releasing their internal incident response documentation. This is really exciting, and I’d love it if more companies did this. Their incident response procedures are detailed and obviously the result of hard-won experience. The hierarchical, almost militaristic command and control structure is intriguing and makes me wonder what problems they’re solving.

Lots of detail on New Relic’s load testing strategy, along with an interesting tidbit:

In addition, as we predicted, many sites deployed new deal sites specifically for Cyber Monday with less than average testing. Page load and JavaScript error data represented by far the largest percentage increase in traffic volume, with a 56% bump[…]

Last in the series, this article is an argument that metrics aren’t always enough. Sometimes you need to see the details of the actual events (requests, database operations, etc) that produced the high metric values, and traditional metrics solutions discard these in favor of just storing the numbers.

Let’s Encrypt has gone through a year of intense growth in usage. Their Incidents page has some nicely detailed postmortems, if you’re in the mood.

An eloquent post on striving toward a learning culture in your organization, as opposed to a blaming one, when discussing adverse incidents.

I like to include the occasional debugging deep-dive article, because it’s always good to keep our skills fresh. Here’s one from my coworker on finding the source of an unexpected git error message.

Full disclosure: Heroku, my employer, is mentioned.

Outages

SRE Weekly Issue #53

SPONSOR MESSAGE

The “2016/17 State of On-Call” report from VictorOps is now available to download. Learn what 800+ respondents have to say about life on-call, and steps they’re taking to make it better. Get your free copy here: https://victorops.com/state-of-on-call

Articles

Without explicit limits, things fail in unexpected and unpredictable ways. Remember, the limits exist, they’re just hidden.

AWS gives us this in-depth explanation of their use of shuffle sharding in the Route 53 service. This is especially interesting given the Dyn DDoS attack a couple of months ago.

How does container networking work? Julia Evans points her curious mind toward this question and shares what she learned.

[…] it’s important to understand what’s going on behind the scenes, so that if something goes wrong I can debug it and fix it.

More on the subject of percentiles and incorrect math this week from Circonus. The SLA calculation stuff is especially on point.

And speaking of SLAs, here’s an excellent article on how to design and adopt an SLA in your product or service.

A summary of a few notable Systems We Love talks. I’m so jealous of all of you folks that got to go!

PagerDuty added #OnCallSelfie support to their app. Amusingly, that first picture is of my (awesome) boss.  Hi, Joy!

A post-analysis of an Azure outage from 2012. The especially interesting thing to me is the secondary outage caused by eagerness to quickly deploy a fix to the first outage. There’s a cognitive trap here: we become overconfident when we think we’ve found The Root Cause and we rush to deploy a patch.

Outages

SRE Weekly Issue #52

Merry Decemberween, all!  Much like trash pickup service, SRE Weekly comes one day late when it falls on a holiday.

SPONSOR MESSAGE

The “2016/17 State of On-Call” report from VictorOps is now available to download. Learn what 800+ respondents have to say about life on-call, and steps they’re taking to make it better. Get your free copy here: https://victorops.com/state-of-on-call

Articles

Percentiles are tricky beasts. Does that graph really mean what you think it means?

The math is just broken. An average of a percentile is meaningless.

Thanks to Devops Weekly for this one.

There’s that magical “human error” again.

ChangeIP suffered a major outage two weeks ago and they posted this analysis of the incident. Thanks, folks! Does this sound familiar?

We learned that when we started providing this service to the world, we made design and data layout decisions that made sense at the time but no longer do.

Shuffle sharding is a nifty technique for preventing impact from spreading to multiple users of your service. A great example is the way Route 53 assigns nameservers for hosted DNS zones:

sreweekly.com. 172800 IN NS ns-442.awsdns-55.com.
sreweekly.com. 172800 IN NS ns-894.awsdns-47.net.
sreweekly.com. 172800 IN NS ns-1048.awsdns-03.org.
sreweekly.com. 172800 IN NS ns-1678.awsdns-17.co.uk.

Fastly has a brilliant, simple, and clever solution to load balancing and connection draining using a switch ignorant of layer 4.

Incurring connection resets on upgrades has ramifications far beyond disrupting production traffic: it provides a disincentive for continuous software deployment.

Heroku shared a post-analysis of their major outage on December 15.

Full disclosure: Heroku is my employer.

Outages

  • NTP server pool
    • Load on the worldwide NTP server pool increased significantly due to a “buggy Snapchat app update”. What was Snapchat doing with NTP? (more details)
  • Zappos
    • Zappos had a cross-promotion with T-Mobile, and the traffic overloaded them.Thanks to Amanda Gilmore for this one.
  • Slack
    • Among other forms of impairment, /-commands were repeated numerous times. At $JOB, this meant that people accidentally paged their coworkers over and over until we disabled PagerDuty.
  • Librato
    • “What went well” is an important part of any post-analysis.
  • Tumblr
  • Southwest Airlines
A production of Tinker Tinker Tinker, LLC Frontier Theme