SRE Weekly Issue #93

Articles

Julia Evans tells us why she likes Kubernetes, and along the way explains how its resilient architecture works.

distsys-class/README.markdown at master · aphyr/distsys-class · GitHub

From the Jepsen folks, this outline is detailed enough to read by itself:

This outline accompanies a 12-16 hour overview class on distributed systems fundamentals. The course aims to introduce software engineers to the practical basics of distributed systems, through lecture and discussion. Participants will gain an intuitive understanding of key distributed systems terms, an overview of the algorithmic landscape, and explore production concerns.

When Optimising For Robustness Fails

In this article Steve Smith explains why a production environment is always in a state of near-failure, why optimising for robustness results in a brittle incident response process, and why Dual Value Streams are a common countermeasure to failure.

What will programming look like in the future?

This article seems like a direct reply to last week’s “The Coming Software Apocalypse“. I gave that one a good review, so I feel compelled to include this refutation, but I was left really wishing for more detail on the arguments put forward. Perhaps there’s more to come?

Better requirements and better tools have already been tried and found wanting. Requirements are a trap. They don’t work. Requirements are no less complex and undiscoverable than code.

Monitoring in the time of Cloud Native

This is an article version of Cindy Sridharan’s Velocity 2017 talk. She covers a lot, including major monitoring methods, existing OSS tools, the pitfalls of each, and how to achieve observability in a cloud-based infrastructure.

Mitigating replication lag and reducing read load with freno

GitHub ensures low MySQL replication lag by rate-limiting expensive batch-processing queries based on replica lag. Before freno, this logic resided in each client, with multiple implementations in different languages. Freno (which is open source) centralizes the replica lag polling and query rate-limiting decisions into a queryable service.

Open Sourcing Iris and Oncall

Earlier this year, LinkedIn open sourced their alerting system duo. Together, these tools provide functionality similar to vendor solutions like PagerDuty and VictorOps.

NGINX Rate Limiting

Here’s a great guide to rate-limiting in NGINX including config snippets.

Developer Experience Lessons Operating a Serverless-like Platform At Netflix

Netflix has an in-house serverless environment on which they run “nano-services”. It has nifty features including automatic pre-warming, gradual roll-out scheduling, and canary deployments.

Transit and Peering: How your requests reach GitHub

GitHub details their Internet-facing network topology and explains how they use traffic engineering to ensure their connectivity is fast and reliable.

The pitfalls of A/B testing in social networks

What if two people try to interact, but only one of them is flagged into a new feature? OKCupid tells us why A/B testing is much harder than it seems, and then they explain how they developed useful test cohorts.

Focus on Remediation: Leverage Runbooks to Reduce MTTR

A primer on runbooks, including a nice template you can use as a starting point in writing yours.

This article is published by my sponsor, VictorOps, but their sponsorship did not influence its inclusion in this issue.

SRE Weekly Issue #93

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

SPONSOR MESSAGE

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues