SRE Weekly Issue #14

Articles

A classic from John Allspaw. Designing a resilient system isn’t about eliminating individual causes of downtime; it’s about continuing to operate in spite of them. Allspaw is a big proponent of looking beyond human error to the system surrounding the error.

…human error as a root cause isn’t where you should end, it’s where you should start your investigation.

This could just as well be titled, 7 Rules for Performing Effective Retrospectives. There’s some really great stuff in here and also some good references.

The rules are:

  • Learn, don’t blame
  • Know the scope of the system
  • Make sure you have all the relevant logs
  • Make sure the logs lineup with the timeline
  • Separate the noise from the information
  • Make sure the biases are known
  • Make sure you deal in facts and not counterfacts
Spotify shares this deeply technical look at their event delivery and processing system that handles 700k messages per second. The bulk of the article details how they tested Google’s Cloud Pub/Sub to be sure it was reliable enough for their needs.

…Cloud Pub/Sub was being advertised as beta software; we were unaware of any organisation other than Google who were using it at our scale.

Taobao.com suffered a huge security breach in which credentials harvested from previous break-ins were used to break into accounts. This short write-up on DZone urges us to use anomaly detection to catch brute-force attacks like this as they happen.

Reports say the hackers executed approximately 100 million login attempts, and almost 21 million of these turned out to be successful.

Speaking of anomaly detection, this article highlights the problems with existing anomaly detection systems and describes what a successful system would look like. I’ve yet to see a generalized anomaly detection system with an acceptable false positive rate that did better than specific, targeted monitoring.

This survey, released last month, looks possibly interesting. I’m not 100% sure though, because their server is offline and I can’t retrieve it. Oh, the irony.

Outages

  • Xbox Live
  • Fox and ABC News
    • Two large news sites suffered brief outages on Super Tuesday, an important voting day in the US. Both were apparently taken out by a failure in the analytics provider that they share in common.

  • DirecTV
  • PSN
  • Netflix
  • The Pirate Bay
  • EE webmail
  • Amazon.com
  • CenturyLink
    • Miscommunication is cited in this construction-induced fiber cut.

  • The Division (game)
  • The KKK
    • Staminus, a DDoS protection company, suffered a huge data breach including full names and credit card numbers. The attackers also took down their infrastructure causing an outage for big-name clients such as the KKK.

Updated: March 13, 2016 — 1:01 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme