SRE Weekly Issue #17

I’m posting this week’s issue from the airport on my way to the west coast for business and SRECon16. I’m hoping to see some of you there! I’ll have a limited amount of incredibly exclusive hand-made SRE Weekly patches to give out — just ask.

Articles

I love surveys! This one is about incident response as it applies to security and operations. The study author is looking to draw parallels between these two kinds of IR. I can’t wait for the results, and I’ll definitely link them here.

Charity Majors gives us this awesomely detailed article about a Terraform nightmare. An innocent TF run in staging led to a merry bug-hunt down the rabbit-hole and ended in wiping out production — thankfully on a not-yet-customer-facing service. She continues on with an excellent HOWTO on fixing your Terraform config to avoid this kind of pitfall.

If you can’t safely test your changes in isolation away from prod, you don’t have infrastructure as code.

Uber set up an impressively rigorous test to determine which combination of serialization format and compression algorithm would hit the sweet spot between data size and compression speed. The article itself doesn’t directly touch on reliability, but of course running out of space in production is a deal-breaker, and I just love their methodology.

I make heavy use of Pinboard to automate my article curation for SRE Weekly. This week, IFTTT decided to axe support for Pinboard, and they did it in a kind of jerky way. The service’s owner Maciej wrote up a pretty hilarious and pointed explanation of the situation.

Thanks to Courtney for this one.

HipChat took another outage last week when they tried to push a remediation from a previous outage. Again with admirable speed, they’ve posted a detailed postmortem including dive excellent lessons that we all learn from.

This deployment was an important remediation from the previous outages and seemed like the right thing to do.
Lesson learned: No matter how much you want to remediate a problem for your users, consider the risk, complexity, and timing, and then reconsider again.

I love human error. Or rather, I love when an incident is reported as “human error”, because the story is inevitably more nuanced than that. Severe incidents are always the result of multiple things going wrong simultaneously. In this case, it was an operator mistake, insufficient radios and badges for responders, and lack of an established procedure for alerting utility customers.

A detailed exploration of latency and how it can impact online services, especially games.

Online gaming customers are twice as likely to abandon a game when they experience a network delay of 50 additional milliseconds

Say “eliminate downtime” and I’ll be instantly skeptical, but this article is a nice overview of predictive maintenance systems in datacenters.

Data centers use complex hardware that presents unforeseen problems that calendar-based maintenance checks simply cannot anticipate.

Outages

Updated: April 3, 2016 — 3:02 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme