SRE Weekly Issue #55

Articles

Continuous MySQL backup validation: Restoring backups

Nothing is worse than finding out that your confidence in your backup strategy was ill-founded (the hard way). Facebook prevents this with what is, in retrospect, a blatantly obvious idea that I never thought of: continuously, automatically testing your backups by trying to restore them.

Running Multiple HTTP Endpoints as a Highly Available Health Proxy

Route 53 can do failover based on health checks, but it doesn’t know how to check if a database is healthy. This article discusses using an HTTP endpoint that checks the status of the DB and returns status 200 or 500 depending on whether the DB is up. There’s also a discussion of how to handle failure of the HTTP endpoint itself.

Using Chaos Monkey whenever you feel like it

Chaos Monkey was designed with the idea of having it run all the time on a schedule, but as Mathias Lafeldt shares, you can also (or even exclusively) trigger failures through an API. He even wrote a CLI for the API.

Four reasons developers should write their own load tests

Here’s a link shared with me by its author. If you write something you think other SREs will like, please don’t hesitate to send it my way! I love this article, because load testing is yet another aspect of the growing trend toward developers owning the operation of their code.

The First Four Things You Measure

This article is short and sweet. There are four rock-bottom metrics that you really need to figure out if something is wrong with your service. They had me at “Downstreamistan”.

Chaos Engineering

This description of Chaos Engineering is more rigorous than casual articles, making for a pretty interesting read even if you already know all about it.

Although the term “chaos” evokes a sense of unpredictability, a fundamental assumption of chaos engineering is that complex systems exhibit behaviors regular enough to be predicted.

A Chilling PBS Documentary Shows How Mistakes Are Made

I haven’t had a chance to watch this yet, but the description is riveting even by itself. Click through for a link to play the documentary directly.

Outages

Second Life
- One transit provider failed and automatic failover didn’t work. Once they were back up, the subsequent thundering herd of logins threatened to take them back down. Click through for a detailed post-analysis.
S3, EC2 API
- On January 10, S3 had issues processing DELETE requests (though you wouldn’t know it from looking at the history section of their status page). Various (presumably) dependent services such as Heroku and PackageCloud.io had simultaneous outages.
  Full disclosure: Heroku is my employer.
Lloyds Bank
Mailgun
Battlefield 1
Facebook

SRE Weekly Issue #55

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

SPONSOR MESSAGE

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues