SRE Weekly Issue #5


What does owning your availability really mean? Brave New Geek argues that it simply means owning your design decisions. I love this quote:

An SLA is not an insurance policy or a hedge against the business impact of an outage, it’s merely a refund policy.

Apparently last week’s BBC outage was “just a test”. Now we have to defend our networks against misdirected hacktivism?

Increased deployment automation leads to the suggestion that developers can now “do ops” (see also: “NoOps”). This author explains why operations is much more than deployment.

Full disclosure: Heroku, my employer, is briefly mentioned.

Tips on how to move toward rapid releases without drastically increasing your risk of outages. They cite the Knight Capital automated trading mishap as a cautionary example, along with Starbucks and this week’s Oyster outage.

Facebook uses configuration for many facets of its service, and they embrace “configuration as code”. They make extensive use of automated testing and canary deployments to keep things safe.

Thousands of changes made by thousands of people is a recipe for configuration errors – a major source of site outages.

PagerDuty shares a few ideas about how and why to do retrospective analysis of incidents.

Another talk from QCon. Netflix’s Nitesh Kant explains how an asynchronous microstructure architecture naturally supports graceful degradation. (thanks to DevOps Weekly for the link)

One of the fallacies of distributed computing. This ACM Queue article is an informal survey of all sorts of fascinating ways that networks fail.


Updated: January 10, 2016 — 8:18 am
A production of Tinker Tinker Tinker, LLC Frontier Theme