SRE Weekly Issue #97

Articles

Last month, I linked to an article on Xero’s incident response process, and I said:

I find it interesting that incident response starts off with someone filling out a form.

This article goes into detail on how the form works, why they have it, and the actual questions on the form! Then they go on to explain their “on-call configuration as code” setup, which is really nifty. I can’t wait to see part II and beyond.

Stretching Spokes

Spokes is GitHub’s system for storing distributed replicas of git repositories. This article explains how they can do this over long distances in a reasonable amount of time (and why that’s hard). I especially love the “Spokes checksum” concept.

Fly the airplane: Three practices for effective incident response

From the CEO of NS1, a piece on the value of checklists in incident response.

The Ultimate Guide to Secondary DNS

Here’s another great guide on the hows and whys of secondary DNS, including options on dealing with nonstandard record types that aren’t compatible with AXFR.

Availability has a new meaning. And it doesn’t include planned downtime.

From a customer’s perspective, “planned downtime” and “outage” often mean the same thing.

Risks of a “serverless” future: dissolving valuable infrastructure

“serverless” != “NoOps”

Willis urges the importance of integration with existing operations processes over replacement. “Serverless is just another form of compute. … All the core principles that we’ve really learned about high-performance organizations apply differently … but the principles stay the same,” he said.

Root Cause Analysis as Storytelling – Wide Awake Developers

When we use root cause analysis, says Michael Nygard, we narrow our focus into counter-factuals that get in the way of finding out what really happened.

CW: hypothetical violent imagery

Outages

This week had a weirdly large number of outages!

Heroku
- Heroku posted a public followup for incident #1334, with a pretty interesting cause. At the end of the month, load on an internal API increased because the number of apps that ran out of monthly free quota hit a peak.
  Full disclosure: Heroku is my employer.
How a Tiny Error Shut Off the Internet for Parts of the US
- I normally don’t include ISP failures, but this one was widespread across the US and had an interesting cause. Level 3 accidentally created a route leak that broke traffic for many Comcast customers (including me).
Google App Engine Memcache Service
- Linked is Google’s followup analysis, which suggests that the outage was due to a scaling issue in a configuration database.
OVH to Disassemble Container Data Centers after Epic Outage in Europe
Snapchat
Instagram
E-Trade
Grindr
Netflix
Yahoo Mail

SRE Weekly Issue #97

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

SPONSOR MESSAGE

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues