Articles
An outline of the design of Netflix’s new load balancer, with special emphasis on dealing with faltering backends. Great idea: servers report their utilization level in a response header. Tricky pitfall: the LB is so good at moving requests off of ailing backends that backend failure rate alerts don’t fire.
Mike Smith — Netflix
This article begins by explaining consistency versus availability in distributed data stores and argues that the trade-off is less significant than one might think. Then it describes a pitfall seen in some new data stores. I’ve pondered aloud here in the past on how Spanner can make the claims it does, and this article explains that nicely.
Daniel Abadi
And here’s a refutation of part of the previous article by the creator of RavenDB.
Ayende Rahien
It is tempting to think that ensuring the resilience or continuity of all the individual parts of a business will guarantee the resilience or continuity of the whole.
Dr. Sandra Bell
GitHub used an innovative technique to avoid holding open a long-running code branch while upgrading their application to support rails 5.2.
Eileen Uchitelle — GitHub
Worker node errors led to cascading failure when they hit Google Compute Engine quotas.
Bogdana Vereha — Travis CI
This week, the US Internal Revenue Service (IRS) issued a report analyzing the tax-day outage that occurred this past April. Linked is a nice summary by the Register.
Thanks to reader Michael Fischer for a tip on the report.
Chris Mellor — The Register
Outages
- Amazon Alexa
- Delta Airlines
- Honeywell (smart thermostat manufacturer)
- Zoho
- SaaS provider Zoho’s domain registration was revoked by its registrar after a run-of-the-mill phishing complaint, affecting 30 million users.
- Steemit