SRE Weekly Issue #92

View on sreweekly.com

Shout-out to all the folks I met at Velocity! It was an exhilarating week filled with awesome personal conversations and some really incredible talks.

Then I came back to Earth to discover that everyone chose this week to write awesome SRE-related articles. I’m still working my way through them, but get ready for a great issue.

Articles

The Stella Report

This is the blockbuster PDF dropped by the SNAFUcatchers during their keynote on day two of Velocity. Even just the 15-minute summary by Richard Cook and David Woods had me on the edge of my seat. In this report, they summarize the lessons gleaned from presentations of “SNAFUs” by several companies during winter storm Stella.

SNAFUs are anomalous situations that would have turned into outages were it not for the actions taken by incident responders. Woods et al. introduced a couple of concepts that are new to me: “dark debt” and “blameless versus sanctionless”. I love these ideas and can’t wait to read more.

IT incident response ditches root cause analysis process

Chaos engineering unearths IT deployments’ dark debt

These two articles provide a pretty good round-up of the ideas shared at Velocity this past week.

The Coming Software Apocalypse

This one starts with a 6-hour 911 (emergency services) outage in 2014 and the Toyota unintended acceleration incidents, and then vaults off into really awesome territory. Research is being done into new paradigms of software development that leave the programming to computers, focusing instead on describing behavior using a declarative language. The goal: provably correct systems. Long read, but well worth it.

The Value of Optimizing for Resilience

Drawing from Woods, Allspaw, Snowden, and others, this article explains how and why to improve the resilience of a system. There’s a great hypothetical example of graceful degradation that really clarified it for me.

Nines don’t matter T-Shirt

In a recent talk, Charity Majors made waves by saying, “Nines don’t matter when users aren’t happy.” Look, you can have that in t-shirt and mug format!

Beta Testing in Production Like a Pro

A summary of how six big-name companies test new functionality by gradually rolling it out in production.

How New, Resilient Networks Change Data Center Design

This article jumps off from Azure’s announcement of availability zones to discuss a growing trend in datacenters. We’re moving away from highly reliable “tier 4” datacenters and pushing more of the responsibility for reliability to software and networks.

Ever wanted to know how Xero does incident management?

Of course I do, and I don’t even know who Xero is! They use chat, chatops, and Incident Command, like a lot of other shops. I find it interesting that incident response starts off with someone filling out a form.

Outages

PagerDuty
- PagerDuty posted a lengthy followup report on their outage on September 19-21. TL;DR: Cassandra. It was the worst kind of incident, in which they had to spin up an entirely new cluster and develop, test, and enact a novel cut-over procedure. Ouch.
Heroku
- Heroku suffered a few significant outages. The one linked above includes a followup that describes a memory leak in their request routing layer. These two don’t yet have followups: #1298, #1301
  Full disclosure: Heroku is my employer.
Azure
- On September 29, Azure suffered a 7-hour outage in Northern Europe. They’ve released a preliminary followup that describes an accidental release of fire suppression agent and the resulting carnage. Microsoft promises more detail by October 13.
  Unfortunately can’t deep-link to this followup, so just scroll down to 9/29.
New Relic
Blackboard (education web platform)

SRE Weekly Issue #92

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

SPONSOR MESSAGE

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues