SRE Weekly Issue #62

S3 fails, and suddenly it’s SRE “go-time” at companies everywhere!  I don’t know about you, but I sure am exhausted.

SPONSOR MESSAGE

DevOps incident management at its finest. Start your free trial of VictorOps. http://try.victorops.com/sreweekly/trial

Articles

When you do as the title suggests, you realize that network partitions go from the realm of theoretical to everyday.

Asana shares their “Five Whys” process, which they use not only for outages but even for missed deadlines. This caught my eye:

Our team confidently focuses on problem mitigation while fighting a fire, knowing that there will be time for post-mortem and long-term fixes later.

Using Route 53 as a case study, AWS engineers explain how they carefully designed their deploy process to reduce impact from failed deploys.

One method to reduce potential impact is to shape your deployment strategies around the failure conditions of your service. Thus, when a deployment fails, the service owner has more control over the blast radius as well as the scope of the impact.

GitHub used a data-driven approach when migrating a storage load from Redis to MySQL. It’s a good thing they did, because a straight one-for-one translation would have overloaded MySQL.

We’ve heard before that it’s important to make sure that your alerts are actionable. I like that this article goes into some detail on why we sometimes tend to create inactionable alerts before explaining how to improve your alerting.

This article is published by my sponsor, VictorOps, but their sponsorship did not influence its inclusion in this issue.

Ubuntu backported a security fix into Xenial’s kernel last month, and unfortunately, they introduced a regression. Under certain circumstances, the kernel will give up way too easily when attempting to find memory to satisfy an allocation and will needlessly trigger the OOM killer. A fix was released on February 20th.

Need to tell someone their perpetual motion machine CAP-satisfying system won’t work? Low on time? Use this handy checklist to explain why their idea won’t work.

GitLab seriously considered fleeing the cloud for a datacenter, and they asked the community for feedback. That feedback was very useful and was enough to change their minds. The common theme: “you are not an infrastructure company, so why try to be one?”

If you’ve got a firehose of events going into your metrics/log aggregation system, you may need to reduce load on it by only sending in a portion of your events. Do you pick one out of every N? HoneyComb’s makers suggest an interesting alternative: tag each sampled event you send as representing N events from the source — and N is allowed to very between samples.

Outages

  • Amazon S3
    • Amazon S3 in the us-east-1 region went down, taking many sites and services down with it, including Trello, Heroku, portions of Slack and GitHub, and tons more. Amazon’s status page had a note at the top but was otherwise green across the board for hours.  Meanwhile nearly 100% of S3 requests failed and many other AWS services burned as well.Their outage summary (linked above) indicated that the outage uncovered a dependency of their status site on S3. Oops. Once they got that fixed a few hours later, they posted something I’ve never seen before: actual red icons.Full disclosure: Heroku is my employer.
  • Joyent: Postmortem for July 27 outage of the Manta service
    • Here’s a deeply technical post-analysis of a Postgresql outage that Joyent experienced in 2015. A normally benign automatic maintenance (an auto-vacuum) turned into total DB lockup due to their workload.
  • PagerDuty
  • GoDaddy
    • DDoS attack on their nameservers.
Updated: March 5, 2017 — 9:14 pm
SRE WEEKLY © 2015 Frontier Theme