SRE Weekly Issue #112

Articles

Spooky action at a distance, how an AWS outage ate our load balancer

an outage of a provider that we don’t use, directly or indirectly, resulted in our service becoming unavailable.

I don’t think I even need to add anything to that to make you want to read this article.

Fran Garcia — Hosted Graphite

GitHub’s report on the Memcached-based DDoS

The big story this week is the memcached UDP amplification DDoS method, used to send 1.3 Tbps (!) toward our friends at GitHub. Their description is linked above.

Sam Kottler — GitHub

The internet was alight with related discussions:

Cloudflare’s description of the attack
Akamai’s story about helping GitHub survive the attack
memcached developer announces release that disables UDP by default
with commentary
Charity Majors also had some amusing commentary
Wired’s story on the GitHub attack

Runbook Template

An excellent template that you can use as a basis for writing runbooks.

Catie McCaffrey

DevOps and SRE Contribution – The Lemur Book

This author of an upcoming O’Reilly book is looking for small contributions for a crowd-sourced chapter:

In two paragraphs or less, what do you think is the relationship between DevOps and SRE? How are they similar? How are they different? Can both be implemented at every organization? Can the two exist in the same org at the same time? And so on…

David Blank-Edelman

Meet Bandaid, the Dropbox service proxy

Bandaid started as a reverse proxy that compensated for inefficiencies in our server-side services.

I’m intrigued by the way it handles its queue in last-in first-out order, on the theory that a request that’s been waiting for a long time is likely to be cancelled by its requester.

Dmitry Kopytkov and Patrick Lee — Dropbox

5 of the world’s biggest network outages

This is an amusing recap of five major outages of the past few years. If you’ve been subscribed for awhile, it’ll be review, but I still enjoyed the reminder.

Michael Rabinowitz

Fail-slow at scale: When the cloud stops working

This article summarizes a new research paper on “fail-slow” hardware failures. When hardware only kind of fails, it can often have more disastrous consequences than when it stops working outright.

Robin Harris — Storage Bits

Launching An Entire Fireworks Display At Once

This is an awe-inspiring way to make a point about designing systems to be resilient to human error.

If it’s possible for a human to hit the wrong button and set off an entire fireworks display by accident, then maybe the problem isn’t with the human; it’s with that button.

If it’s possible to mix up minutes and fractions of a second like we’ve done deliberately, then maybe the system isn’t clear, or maybe the pre-launch checklist isn’t thorough enough.

Tom Scott

Managing Feature Flag Debt with Split

There are some really great ideas in this article around preventing and ameliorating the technical debt that can be inherent in the use of feature flags. Ostensibly this article is about using Split.io, but the ideas are broadly applicable.

Adil Aijaz — Split

Outages

Slack
- Possibly due to the AWS outage.Thanks to marc on hangops #incident_response for this one.
AWS us-east-1
Statuspage.io
Abebooks
LinkedIn
AOL Email
Vero (social network)
CoinsMarkets (cryptocurrency exchange)

SRE Weekly Issue #112

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

SPONSOR MESSAGE

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues