SRE Weekly Issue #4

Articles

A nifty-looking packet generator with packets crafted by Lua scripts. If this thing lives up to the hype in its documentation, it’d be pretty awesome! Thanks to Chris Maynard for the link and for the sleepless days and nights we spent mucking with trafgen’s source.

Just as we design systems to be monitored, this article suggests that we should design systems to be audited. Doing the work up front and incrementally rather than as an afterthought can take the pain out of auditing.

A nice intro to structured logging. I’m a big fan of ELK, and especially using Logstash to alert on events that might be difficult to catch otherwise.

I looked at a few “lessons learned from black Friday 2015” articles, but they’re all low on good technical detail. My consolation prize is this article that seems eerily appropriate, given Target’s outage on Cyber Monday.

The strategy of turning away only some requesters to avoid a full site outage is interesting, but I could see it causing a thundering herd problem if not done carefully, where folks just repeatedly hit reload and cause more traffic.

These “predictions” (suggestions, really) about load testing may be review to some, but this article caught my interest because it was the first time I’d heard the term Performance Engineering. Definitely a field worth paying attention to as it becomes more prevalent due to its overlap with SRE.

Modern medicine has been working through very similar issues to SRE, related to controlling the impact of human error through process design and analysis of human factors. We stand to learn a lot from articles such as this one. For example, they’ve been doing the “blameless retrospective” for a long time:

As the attitude to adverse events has changed from the defensive “blame and shame culture” to an open and transparent healthcare delivery system, it is timely to examine the nature of human errors and their impact on the quality of surgical health care.

A speedy and detailed postmortem from Valve on the Steam issue on Christmas.

Outages

This issue covers Christmas and New Year’s, and we have quite a list of outages. Notably lacking from this list is Xbox Live, despite threats reported in the last issue.

Updated: January 2, 2016 — 8:13 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme