SRE Weekly Issue #22

Articles

Landon McDowell, my (incredibly awesome) former boss at Linden Lab, wrote this article in 2014 detailing a spate of bad luck and outages they’d suffered. Causes included hardware failures, DDoS, and an integer DB column hitting its maximum value.

Apairy — Multi-protocol load testing by replaying traffic

I worked on testing the new class of database hardware mentioned in the previous article. In order to be sure the new hardware could handle our specific query pattern, I captured and replayed production queries in real-time using an open source tool written years earlier at Linden Lab called Apiary. This simple but powerful concept (capture and replay) was first introduced to me by one of Apiary’s co-authors, Charity Majors. I’ve since hacked a ton on Apiary and used at two subsequent jobs.

Empty DDoS Threats: Meet the Armada Collective

A group calling themselves the Armada Collective has been making DDoS extortion threats to many companies recently. Cloudflare called them out as entirely toothless, with no actual attacks, but apparently some companies have paid anyway.

Diagnosing performance degradation under adverse circumstances

An excellent deep dive into a performance issue (which really equals a reliability issue), including some good lessons learned.

U.S. Carriers Form “Resiliency Cooperative” to Handle Emergency Situations

This is specifically referring to disaster scenarios such as hurricanes, but the general idea of a “resiliency cooperative” intrigues me.

A video and other startling revelations from the NTSB’s investigation of the fatal Yellow Line smoke incident

A review of the Fire and Emergency Services response found flaws in the actions and procedures taken by the incident commander who was the active fire chief at the time. The NTSB said the commander had not training on the incident management system that would have prepared him to better command the response.

Chaos Monkey for Fun and Profit

Matthias Lafeldt goes deeper into chaos engineering in this latest installment of his series. He also introduces his Dockerized version of Netflix’s Chaos Monkey and shows how to automate chaos experiments to gain further confidence in your infrastructure’s reliability.

Drowning in Alerts: Blame it on Statistical Models for Anomaly Detection

A great overview of the difficulties inherent in anomaly detection and alerting. Note that this article is written by OpsClarity and the end reads a bit like an ad for their service.

Percona to Add Advanced High Availability to Enterprise and Premier Support Offerings

I’m not sure exactly what it is they’re offering now that they weren’t before, but this seems important. I think.

Outages

Telstra
- Telstra made a public commitment of $50 million to improve network resiliency, right about the time that they had a minor network outage. D’oh.
NBA 2K16 (game)
StatsCan (Canada’s Census)
- Canadians attempting to complete their mandatory surveys met with website service interruption
Telkom (South Africa telecom)
Union Bank ATMs
Etisalat (UAE ISP)
Vox (South Africa ISP)
MTN (South Africa telecom)
Elastic Cloud
- Elastic.co blogged a detailed post-incident analysis.

SRE Weekly Issue #22

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues