SRE Weekly Issue #25

Articles

Supermarket Berkshelf Incident Post Mortem

This blows my mind. Chef held a live, public retrospective meeting for a recent production incident. I love this idea and I can only hope that more companies follow suit. The transparency is great, but more than that is their sharing of their retrospective process itself. They have a well-defined format for retrospectives including a statement of blamelessness at the beginning. Kudos to Chef for this, and thanks to Nell Shamrell-Harrington for posting the link on Hangops.

The actual incident was fairly interesting too. The crux of it comes down to this quote that we’ve probably all uttered ourselves at one point or another:

The further distant staging is from production, the more likely we are to introduce a bug.

8 Ways to Reduce Alert Fatigue

PagerDuty has this explanation of alert fatigue and some tips on preventing it. One thing they missed in their list of impacts of alert fatigue: employee attrition, which directly impacts reliability.

How to use Anycast to provide high availability to a RADIUS server

For the network-heads out there, here’s an article on how to set up Anycast routing.

FCC Approves Increased Network Outage Reporting

As we become more dependent on our mobile phones, the FCC is gathering information on provider outages. I, for one, wouldn’t be able to call 911 (emergency services) if AT&T had an outage, because I don’t have a land line.

No Procedure Survives First Contact With a Production Outage

I love this article if only for its title. It’s short, but its thesis bears considering: all the procedure documentation in the world won’t help you if you can’t find it during an incident, or it can’t practically be followed.

The only procedure that is worth a damn is one that has been successfully followed in the heat of battle.

Here comes the FUD! Legacy vendors sure to jump on the Salesforce outage

So when legacy vendors suggest that the Salesforce outage calls cloud into question, they tend to ignore the fact that their own systems suffer regular outages. They just rely on the fact that few people know about them.

Full disclosure: Salesforce (parent company of my employer, Heroku), is mentioned.

How Formal Verification Can Thwart Change-Induced Network Outages and Breaches

An introduction to the application of formal mathematical verification to network configurations. A good overview, but I wish it went into more practical detail.

[…] a software application designer might want to know that her code will never crash or that it will never execute certain functions without an authorized login. These are simple, practical questions – but answering them is computationally challenging because of the enormous number of possible ways code may be executed, […]

IFTTT Blog – Update: Keeping Pinboard on IFTTT

Earlier this year, I featured a story about Pinboard.in and IFTTT. IFTTT released this official apology and explanation of the problems Pinboard.in’s author outlined, and they (unofficially) promised to retain support through the end of 2016. Pinboard.in is an integral part of how I produce SRE Weekly every week, so I’m glad to see that this turned out for the best.

Fault Tolerance on the Cheap

This article is more on the theoretical side than practical, and it’s a really interesting read. It’s the second in a series, but I recommend reading both at once (or skipping the first).

A fault-tolerant system is one in which the unanticipated actions of a subcomponent do not bubble out as unanticipated behavior from the system as a whole.

Outages

Twitter
NS1
- NS1’s CEO posted this incredibly detailed and excellent postmortem on the sophisticated DDoS attacks they suffered.
Pirate Bay
WhatsApp
Virginia (US state) government network
Walmart MoneyCard
Telstra
- Telstra has had a hell of a time this year. This week social media and news were on fire with this days-long Telstra outage. This time, they’re offering customers a $25 credit instead of a free data day. Click through for Telstra’s explanation of what went wrong.
GitLab
- Linked is their post-incident analysis.
Kimbia (May 3)
- A couple weeks ago, Kimbia, a company that helps non-profits raise funds, suffered a massive failure. This occurred during Give Local America, a huge fundraising day for thousands of non-profits in the US, with the result that many organizations had a hard time accepting donations.

SRE Weekly Issue #25

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues