Articles
Last month, I linked to an article on Xero’s incident response process, and I said:
I find it interesting that incident response starts off with someone filling out a form.
This article goes into detail on how the form works, why they have it, and the actual questions on the form! Then they go on to explain their “on-call configuration as code” setup, which is really nifty. I can’t wait to see part II and beyond.
Spokes is GitHub’s system for storing distributed replicas of git repositories. This article explains how they can do this over long distances in a reasonable amount of time (and why that’s hard). I especially love the “Spokes checksum” concept.
From the CEO of NS1, a piece on the value of checklists in incident response.
Here’s another great guide on the hows and whys of secondary DNS, including options on dealing with nonstandard record types that aren’t compatible with AXFR.
From a customer’s perspective, “planned downtime” and “outage” often mean the same thing.
“serverless” != “NoOps”
Willis urges the importance of integration with existing operations processes over replacement. “Serverless is just another form of compute. … All the core principles that we’ve really learned about high-performance organizations apply differently … but the principles stay the same,” he said.
When we use root cause analysis, says Michael Nygard, we narrow our focus into counter-factuals that get in the way of finding out what really happened.
CW: hypothetical violent imagery
Outages
This week had a weirdly large number of outages!
- Heroku
- Heroku posted a public followup for incident #1334, with a pretty interesting cause. At the end of the month, load on an internal API increased because the number of apps that ran out of monthly free quota hit a peak.
Full disclosure: Heroku is my employer.
- Heroku posted a public followup for incident #1334, with a pretty interesting cause. At the end of the month, load on an internal API increased because the number of apps that ran out of monthly free quota hit a peak.
- How a Tiny Error Shut Off the Internet for Parts of the US
- I normally don’t include ISP failures, but this one was widespread across the US and had an interesting cause. Level 3 accidentally created a route leak that broke traffic for many Comcast customers (including me).
- Google App Engine Memcache Service
- Linked is Google’s followup analysis, which suggests that the outage was due to a scaling issue in a configuration database.
- OVH to Disassemble Container Data Centers after Epic Outage in Europe
- Snapchat
- E-Trade
- Grindr
- Netflix
- Yahoo Mail