SRE Weekly Issue #141

SPONSOR MESSAGE

Are you exploring serverless architecture on AWS? Check out this post to get step-by-step instructions for setting up and maintaining DynamoDB to keep it from waking you up with unactionable alerts:

http://try.victorops.com/sreweekly/dynamodb-and-aws

Articles

An outline of the design of Netflix’s new load balancer, with special emphasis on dealing with faltering backends. Great idea: servers report their utilization level in a response header. Tricky pitfall: the LB is so good at moving requests off of ailing backends that backend failure rate alerts don’t fire.

Mike Smith — Netflix

This article begins by explaining consistency versus availability in distributed data stores and argues that the trade-off is less significant than one might think. Then it describes a pitfall seen in some new data stores. I’ve pondered aloud here in the past on how Spanner can make the claims it does, and this article explains that nicely.

Daniel Abadi

And here’s a refutation of part of the previous article by the creator of RavenDB.

Ayende Rahien

It is tempting to think that ensuring the resilience or continuity of all the individual parts of a business will guarantee the resilience or continuity of the whole.

Dr. Sandra Bell

GitHub used an innovative technique to avoid holding open a long-running code branch while upgrading their application to support rails 5.2.

Eileen Uchitelle — GitHub

Worker node errors led to cascading failure when they hit Google Compute Engine quotas.

Bogdana Vereha — Travis CI

This week, the US Internal Revenue Service (IRS) issued a report analyzing the tax-day outage that occurred this past April. Linked is a nice summary by the Register.

Thanks to reader Michael Fischer for a tip on the report.

Chris Mellor — The Register

Outages

Updated: September 30, 2018 — 8:32 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme