SRE Weekly Issue #93

SPONSOR MESSAGE

All Day DevOps is on Oct. 24th! This FREE, online conference offers 100 DevOps-focused sessions across six different tracks. Learn more & register: http://bit.ly/2waBukw

Articles

Julia Evans tells us why she likes Kubernetes, and along the way explains how its resilient architecture works.

From the Jepsen folks, this outline is detailed enough to read by itself:

This outline accompanies a 12-16 hour overview class on distributed systems fundamentals. The course aims to introduce software engineers to the practical basics of distributed systems, through lecture and discussion. Participants will gain an intuitive understanding of key distributed systems terms, an overview of the algorithmic landscape, and explore production concerns.

In this article Steve Smith explains why a production environment is always in a state of near-failure, why optimising for robustness results in a brittle incident response process, and why Dual Value Streams are a common countermeasure to failure.

This article seems like a direct reply to last week’s “The Coming Software Apocalypse“. I gave that one a good review, so I feel compelled to include this refutation, but I was left really wishing for more detail on the arguments put forward. Perhaps there’s more to come?

Better requirements and better tools have already been tried and found wanting. Requirements are a trap. They don’t work. Requirements are no less complex and undiscoverable than code.

This is an article version of Cindy Sridharan’s Velocity 2017 talk. She covers a lot, including major monitoring methods, existing OSS tools, the pitfalls of each, and how to achieve observability in a cloud-based infrastructure.

GitHub ensures low MySQL replication lag by rate-limiting expensive batch-processing queries based on replica lag. Before freno, this logic resided in each client, with multiple implementations in different languages. Freno (which is open source) centralizes the replica lag polling and query rate-limiting decisions into a queryable service.

Earlier this year, LinkedIn open sourced their alerting system duo. Together, these tools provide functionality similar to vendor solutions like PagerDuty and VictorOps.

Here’s a great guide to rate-limiting in NGINX including config snippets.

Netflix has an in-house serverless environment on which they run “nano-services”. It has nifty features including automatic pre-warming, gradual roll-out scheduling, and canary deployments.

GitHub details their Internet-facing network topology and explains how they use traffic engineering to ensure their connectivity is fast and reliable.

What if two people try to interact, but only one of them is flagged into a new feature? OKCupid tells us why A/B testing is much harder than it seems, and then they explain how they developed useful test cohorts.

A primer on runbooks, including a nice template you can use as a starting point in writing yours.

This article is published by my sponsor, VictorOps, but their sponsorship did not influence its inclusion in this issue.

Outages

Updated: October 15, 2017 — 9:19 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme