SRE Weekly Issue #62

S3 fails, and suddenly it’s SRE “go-time” at companies everywhere! I don’t know about you, but I sure am exhausted.

Articles

Think of Latency as a Pseudo-permanent Network Partition

When you do as the title suggests, you realize that network partitions go from the realm of theoretical to everyday.

Ask 5 Whys to get to the root of any problem

Asana shares their “Five Whys” process, which they use not only for outages but even for missed deadlines. This caught my eye:

Our team confidently focuses on problem mitigation while fighting a fire, knowing that there will be time for post-mortem and long-term fixes later.

Organizing Software Deployments to Match Failure Conditions

Using Route 53 as a case study, AWS engineers explain how they carefully designed their deploy process to reduce impact from failed deploys.

One method to reduce potential impact is to shape your deployment strategies around the failure conditions of your service. Thus, when a deployment fails, the service owner has more control over the blast radius as well as the scope of the impact.

Moving persistent data out of Redis

GitHub used a data-driven approach when migrating a storage load from Redis to MySQL. It’s a good thing they did, because a straight one-for-one translation would have overloaded MySQL.

Actionable Alerts

We’ve heard before that it’s important to make sure that your alerts are actionable. I like that this article goes into some detail on why we sometimes tend to create inactionable alerts before explaining how to improve your alerting.

This article is published by my sponsor, VictorOps, but their sponsorship did not influence its inclusion in this issue.

How to stop Ubuntu Xenial from randomly killing your big processes

Ubuntu backported a security fix into Xenial’s kernel last month, and unfortunately, they introduced a regression. Under certain circumstances, the kernel will give up way too easily when attempting to find memory to satisfy an allocation and will needlessly trigger the OOM killer. A fix was released on February 20th.

Beating the CAP Theorem Checklist

Need to tell someone their ~~perpetual motion machine~~ CAP-satisfying system won’t work? Low on time? Use this handy checklist to explain why their idea won’t work.

Why we are not leaving the cloud

GitLab seriously considered fleeing the cloud for a datacenter, and they asked the community for feedback. That feedback was very useful and was enough to change their minds. The common theme: “you are not an infrastructure company, so why try to be one?”

Instrumenting High Volume Services: Part 2

If you’ve got a firehose of events going into your metrics/log aggregation system, you may need to reduce load on it by only sending in a portion of your events. Do you pick one out of every N? HoneyComb’s makers suggest an interesting alternative: tag each sampled event you send as representing N events from the source — and N is allowed to very between samples.

Outages

Amazon S3
- Amazon S3 in the us-east-1 region went down, taking many sites and services down with it, including Trello, Heroku, portions of Slack and GitHub, and tons more. Amazon’s status page had a note at the top but was otherwise green across the board for hours. Meanwhile nearly 100% of S3 requests failed and many other AWS services burned as well.Their outage summary (linked above) indicated that the outage uncovered a dependency of their status site on S3. Oops. Once they got that fixed a few hours later, they posted something I’ve never seen before: actual red icons.Full disclosure: Heroku is my employer.
Joyent: Postmortem for July 27 outage of the Manta service
- Here’s a deeply technical post-analysis of a Postgresql outage that Joyent experienced in 2015. A normally benign automatic maintenance (an auto-vacuum) turned into total DB lockup due to their workload.
PagerDuty
GoDaddy
- DDoS attack on their nameservers.

SRE Weekly Issue #62

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

SPONSOR MESSAGE

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues