Landon McDowell, my (incredibly awesome) former boss at Linden Lab, wrote this article in 2014 detailing a spate of bad luck and outages they’d suffered. Causes included hardware failures, DDoS, and an integer DB column hitting its maximum value.
I worked on testing the new class of database hardware mentioned in the previous article. In order to be sure the new hardware could handle our specific query pattern, I captured and replayed production queries in real-time using an open source tool written years earlier at Linden Lab called Apiary. This simple but powerful concept (capture and replay) was first introduced to me by one of Apiary’s co-authors, Charity Majors. I’ve since hacked a ton on Apiary and used at two subsequent jobs.
A group calling themselves the Armada Collective has been making DDoS extortion threats to many companies recently. Cloudflare called them out as entirely toothless, with no actual attacks, but apparently some companies have paid anyway.
An excellent deep dive into a performance issue (which really equals a reliability issue), including some good lessons learned.
This is specifically referring to disaster scenarios such as hurricanes, but the general idea of a “resiliency cooperative” intrigues me.
A video and other startling revelations from the NTSB’s investigation of the fatal Yellow Line smoke incident
A review of the Fire and Emergency Services response found flaws in the actions and procedures taken by the incident commander who was the active fire chief at the time. The NTSB said the commander had not training on the incident management system that would have prepared him to better command the response.
Matthias Lafeldt goes deeper into chaos engineering in this latest installment of his series. He also introduces his Dockerized version of Netflix’s Chaos Monkey and shows how to automate chaos experiments to gain further confidence in your infrastructure’s reliability.
A great overview of the difficulties inherent in anomaly detection and alerting. Note that this article is written by OpsClarity and the end reads a bit like an ad for their service.
I’m not sure exactly what it is they’re offering now that they weren’t before, but this seems important. I think.
Telstra made a public commitment of $50 million to improve network resiliency, right about the time that they had a minor network outage. D’oh.
- NBA 2K16 (game)
- StatsCan (Canada’s Census)
Canadians attempting to complete their mandatory surveys met with website service interruption
- Telkom (South Africa telecom)
- Union Bank ATMs
- Etisalat (UAE ISP)
- Vox (South Africa ISP)
- MTN (South Africa telecom)
- Elastic Cloud
Elastic.co blogged a detailed post-incident analysis.