Welcome to the many new subscribers that joined in the past week. I’m not sure who I have to thank for the sudden surge, but whoever you are, thanks!
View on sreweekly.comArticles
What can the fire service learn about safety from the aviation industry? A 29-year veteran in the fire service answers that question in detail. We could in turn apply all of those lessons to operating complex infrastructures.
I’m surprised that I haven’t come across the term “High Reliability Organization” before reading the previous article. Here’s Wikipedia’s article on HROs.
A high reliability organization (HRO) is an organization that has succeeded in avoiding catastrophes in an environment where normal accidents can be expected due to risk factors and complexity.
Etsy instruments their deployment system to strike a vertical line on their graphite graphs for every deploy. This helps them quickly figure out whether a deploy happened right before a key metric took a turn for the worse.
A really interesting dive into the world of subsea network cables and the impact that cuts can have on businesses worldwide.
How closely can you really mimic production in your testing environments? In a way we’re all testing in production, and this article talks about getting that fact out in the open.
I wrote this article on my terrible little blog back in 2008 — forgive the horrid theme and apparently broken unicode support. This was well before I worked in Linden Lab’s Ops team, back when I was making a living as a user selling content in Second Life. What’s interesting to me in reading this article 8 years later is the user perspective on the impact of the string of bad outages, and especially Linden’s poor communication during outages.
More on the impact of Delta Airline’s major outage last month.
Most often a catastrophic failure is not due to a lack of standards, but a breakdown or circumvention of established procedures that compounded into a disastrous outcome. Multilayer complex systems outages signify management failure to drive change and improvement.
Outages
- Salesforce
-
Salesforce.com was down or impared for several hours.
Full disclosure: Salesforce is the parent company of my employer, Heroku.
-
- dynamodb
- Telstra Mail
- Google Cloud Platform
-
Normally I don’t include single-zone failures in EC2 or GCP, but this one has an extremely interesting and detailed postmortem.
-
- EA (FIFA 16 and Battlefield 1 Beta)
- Vodafone (Ireland)
- Interpublic Group (Hollywood PR agency)
- Vesk
-
The Register noted that Vesk bragged about their 100% uptime even after the outage — including for all of 2016. From Vesk’s recently-changed about page:
VESK has hit 100% uptime for all 2012, 2013, 2014, 2015 and 2016.”
-
- PlayStation Network
- PagerDuty
-
PagerDuty is currently unable to process some inbound events. We are investigating the cause.
-
- Telkom (South Africa telecom)
-
The company cited suspected sabotage and offered a monetary reward.
-
- Washington, DC 911 system
-
Emergency services were knocked out for 90 minutes after a contract worker mistakenly hit the emergency shut-off button. The phrase “human error” is being tossed about.
-