The actual incident was fairly interesting too. The crux of it comes down to this quote that we’ve probably all uttered ourselves at one point or another:
The further distant staging is from production, the more likely we are to introduce a bug.
The only procedure that is worth a damn is one that has been successfully followed in the heat of battle.
So when legacy vendors suggest that the Salesforce outage calls cloud into question, they tend to ignore the fact that their own systems suffer regular outages. They just rely on the fact that few people know about them.
Full disclosure: Salesforce (parent company of my employer, Heroku), is mentioned.
[…] a software application designer might want to know that her code will never crash or that it will never execute certain functions without an authorized login. These are simple, practical questions – but answering them is computationally challenging because of the enormous number of possible ways code may be executed, […]
A fault-tolerant system is one in which the unanticipated actions of a subcomponent do not bubble out as unanticipated behavior from the system as a whole.
NS1’s CEO posted this incredibly detailed and excellent postmortem on the sophisticated DDoS attacks they suffered.
- Pirate Bay
- Virginia (US state) government network
- Walmart MoneyCard
Telstra has had a hell of a time this year. This week social media and news were on fire with this days-long Telstra outage. This time, they’re offering customers a $25 credit instead of a free data day. Click through for Telstra’s explanation of what went wrong.
Linked is their post-incident analysis.
- Kimbia (May 3)
A couple weeks ago, Kimbia, a company that helps non-profits raise funds, suffered a massive failure. This occurred during Give Local America, a huge fundraising day for thousands of non-profits in the US, with the result that many organizations had a hard time accepting donations.