SRE Weekly Issue #25

Articles

This blows my mind. Chef held a live, public retrospective meeting for a recent production incident. I love this idea and I can only hope that more companies follow suit. The transparency is great, but more than that is their sharing of their retrospective process itself. They have a well-defined format for retrospectives including a statement of blamelessness at the beginning. Kudos to Chef for this, and thanks to Nell Shamrell-Harrington for posting the link on Hangops.

The actual incident was fairly interesting too. The crux of it comes down to this quote that we’ve probably all uttered ourselves at one point or another:

The further distant staging is from production, the more likely we are to introduce a bug.

PagerDuty has this explanation of alert fatigue and some tips on preventing it. One thing they missed in their list of impacts of alert fatigue: employee attrition, which directly impacts reliability.

For the network-heads out there, here’s an article on how to set up Anycast routing.

As we become more dependent on our mobile phones, the FCC is gathering information on provider outages. I, for one, wouldn’t be able to call 911 (emergency services) if AT&T had an outage, because I don’t have a land line.

I love this article if only for its title. It’s short, but its thesis bears considering: all the procedure documentation in the world won’t help you if you can’t find it during an incident, or it can’t practically be followed.
The only procedure that is worth a damn is one that has been successfully followed in the heat of battle.

So when legacy vendors suggest that the Salesforce outage calls cloud into question, they tend to ignore the fact that their own systems suffer regular outages. They just rely on the fact that few people know about them.

Full disclosure: Salesforce (parent company of my employer, Heroku), is mentioned.

An introduction to the application of formal mathematical verification to network configurations. A good overview, but I wish it went into more practical detail.
[…] a software application designer might want to know that her code will never crash or that it will never execute certain functions without an authorized login. These are simple, practical questions – but answering them is computationally challenging because of the enormous number of possible ways code may be executed, […]

Earlier this year, I featured a story about Pinboard.in and IFTTT. IFTTT released this official apology and explanation of the problems Pinboard.in’s author outlined, and they (unofficially) promised to retain support through the end of 2016. Pinboard.in is an integral part of how I produce SRE Weekly every week, so I’m glad to see that this turned out for the best.

This article is more on the theoretical side than practical, and it’s a really interesting read. It’s the second in a series, but I recommend reading both at once (or skipping the first).
A fault-tolerant system is one in which the unanticipated actions of a subcomponent do not bubble out as unanticipated behavior from the system as a whole.

Outages

  • Twitter
  • NS1
    • NS1’s CEO posted this incredibly detailed and excellent postmortem on the sophisticated DDoS attacks they suffered.

  • Pirate Bay
  • WhatsApp
  • Virginia (US state) government network
  • Walmart MoneyCard
  • Telstra
    • Telstra has had a hell of a time this year. This week social media and news were on fire with this days-long Telstra outage. This time, they’re offering customers a $25 credit instead of a free data day. Click through for Telstra’s explanation of what went wrong.

  • GitLab
    • Linked is their post-incident analysis.

  • Kimbia (May 3)
    • A couple weeks ago, Kimbia, a company that helps non-profits raise funds, suffered a massive failure. This occurred during Give Local America, a huge fundraising day for thousands of non-profits in the US, with the result that many organizations had a hard time accepting donations.

Updated: May 29, 2016 — 9:50 pm
SRE WEEKLY © 2015 Frontier Theme