My favorite read this week was this first article. It’s long, but it’s well worth a full read.
Articles
Thanks to logikal on hangops #incident_response for this one.
For example, if the system makes it time consuming and difficult to complete safety steps, it is more likely that staff will skip these steps in an effort to meet productivity goals.
I believe that mistakes during incident response in my job don’t lead directly to deaths now, but how soon before they do? And are my errors perhaps causing deaths indirectly even now? (Hat-tip to Courtney E. for that line of thinking.)
Full disclosure: Salesforce (parent company of my employer, Heroku), is mentioned.
Outages
- NBA 2K16
- Westpac (AU bank)
- iiNet (AU ISP)
- Iraq
-
Iraq purportedly shut down its internet access (removed its BGP announcements) to prevent students from cheating on exams.
-
- Virgin Mobile
-
They offered users a data credit immediately.
-
- Telstra
-
Telstra had a long outage this week. They claim that the outage was caused by vandalism in Taree.
-
- Datadog
-
Thanks to acabrera on hangops #incident_response for this one.
-
- Mailgun
- Disney Ticketing
-
Disney’s ticketing site suffered under an onslaught of traffic this week brought on by their free dining deal program. Reference: we had a heck of a time making our dining reservations.
-