SRE Weekly Issue #55

SPONSOR MESSAGE

It’s time to fix your incident management. Built for DevOps, VictorOps helps you respond to incidents faster and more effectively. Try it out for free.

Articles

Nothing is worse than finding out that your confidence in your backup strategy was ill-founded (the hard way). Facebook prevents this with what is, in retrospect, a blatantly obvious idea that I never thought of: continuously, automatically testing your backups by trying to restore them.

Route 53 can do failover based on health checks, but it doesn’t know how to check if a database is healthy. This article discusses using an HTTP endpoint that checks the status of the DB and returns status 200 or 500 depending on whether the DB is up. There’s also a discussion of how to handle failure of the HTTP endpoint itself.

Chaos Monkey was designed with the idea of having it run all the time on a schedule, but as Mathias Lafeldt shares, you can also (or even exclusively) trigger failures through an API. He even wrote a CLI for the API.

Here’s a link shared with me by its author. If you write something you think other SREs will like, please don’t hesitate to send it my way! I love this article, because load testing is yet another aspect of the growing trend toward developers owning the operation of their code.

This article is short and sweet. There are four rock-bottom metrics that you really need to figure out if something is wrong with your service. They had me at “Downstreamistan”.

This description of Chaos Engineering is more rigorous than casual articles, making for a pretty interesting read even if you already know all about it.

Although the term “chaos” evokes a sense of unpredictability, a fundamental assumption of chaos engineering is that complex systems exhibit behaviors regular enough to be predicted.

I haven’t had a chance to watch this yet, but the description is riveting even by itself. Click through for a link to play the documentary directly.

Outages

  • Second Life
    • One transit provider failed and automatic failover didn’t work. Once they were back up, the subsequent thundering herd of logins threatened to take them back down. Click through for a detailed post-analysis.
  • S3, EC2 API
    • On January 10, S3 had issues processing DELETE requests (though you wouldn’t know it from looking at the history section of their status page). Various (presumably) dependent services such as Heroku and PackageCloud.io had simultaneous outages.

      Full disclosure: Heroku is my employer.

  • Lloyds Bank
  • Mailgun
  • Battlefield 1
  • Facebook
Updated: January 15, 2017 — 9:55 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme