SRE Weekly Issue #5

Articles

What does owning your availability really mean? Brave New Geek argues that it simply means owning your design decisions. I love this quote:

An SLA is not an insurance policy or a hedge against the business impact of an outage, it’s merely a refund policy.

ISIS hackers claim responsibility for BBC outage during “test attack”

Apparently last week’s BBC outage was “just a test”. Now we have to defend our networks against misdirected hacktivism?

Operations is More Than Just Systems Administration

Increased deployment automation leads to the suggestion that developers can now “do ops” (see also: “NoOps”). This author explains why operations is much more than deployment.

Full disclosure: Heroku, my employer, is briefly mentioned.

Oyster’s Underground Nightmare: When DevOps Kills Retail – DZone DevOps

Tips on how to move toward rapid releases without drastically increasing your risk of outages. They cite the Knight Capital automated trading mishap as a cautionary example, along with Starbucks and this week’s Oyster outage.

Holistic Configuration Management at Facebook | the morning paper

Facebook uses configuration for many facets of its service, and they embrace “configuration as code”. They make extensive use of automated testing and canary deployments to keep things safe.

Thousands of changes made by thousands of people is a recipe for configuration errors – a major source of site outages.

Quick Tips: How to Post Mortem Every Incident

PagerDuty shares a few ideas about how and why to do retrospective analysis of incidents.

Crossroads of Asynchrony and Graceful Degradation

Another talk from QCon. Netflix’s Nitesh Kant explains how an asynchronous microstructure architecture naturally supports graceful degradation. (thanks to DevOps Weekly for the link)

The Network is Reliable

One of the fallacies of distributed computing. This ACM Queue article is an informal survey of all sorts of fascinating ways that networks fail.

Outages

Nintendo Network Down After Service Suffers Unexpected Outage
HSBC Online Banking
Vodafone
- Flooding in their UK datacenter.
Easyspace
Oyster (London transit system)
Verizon Wireless
Time Warner Cable
Sony PlayStation Network
- Sony has said they will compensate users by extending subscriptions.

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues