SRE Weekly Issue #17

I’m posting this week’s issue from the airport on my way to the west coast for business and SRECon16. I’m hoping to see some of you there! I’ll have a limited amount of incredibly exclusive hand-made SRE Weekly patches to give out — just ask.
View on sreweekly.com

Articles

Security / DevOps / Sysadmin incident response survey

I love surveys! This one is about incident response as it applies to security and operations. The study author is looking to draw parallels between these two kinds of IR. I can’t wait for the results, and I’ll definitely link them here.

Terraform, VPC, and why you want a tfstate file per env

Charity Majors gives us this awesomely detailed article about a Terraform nightmare. An innocent TF run in staging led to a merry bug-hunt down the rabbit-hole and ended in wiping out production — thankfully on a not-yet-customer-facing service. She continues on with an excellent HOWTO on fixing your Terraform config to avoid this kind of pitfall.

If you can’t safely test your changes in isolation away from prod, you don’t have infrastructure as code.

Putting the Squeeze on Trip Data

Uber set up an impressively rigorous test to determine which combination of serialization format and compression algorithm would hit the sweet spot between data size and compression speed. The article itself doesn’t directly touch on reliability, but of course running out of space in production is a deal-breaker, and I just love their methodology.

https://blog.pinboard.in/

I make heavy use of Pinboard to automate my article curation for SRE Weekly. This week, IFTTT decided to axe support for Pinboard, and they did it in a kind of jerky way. The service’s owner Maciej wrote up a pretty hilarious and pointed explanation of the situation.

Thanks to Courtney for this one.

March 25th Incident Report – HipChat Blog

HipChat took another outage last week when they tried to push a remediation from a previous outage. Again with admirable speed, they’ve posted a detailed postmortem including dive excellent lessons that we all learn from.

This deployment was an important remediation from the previous outages and seemed like the right thing to do.
Lesson learned: No matter how much you want to remediate a problem for your users, consider the risk, complexity, and timing, and then reconsider again.

FPUA: Human error caused December natural-gas outage

I love human error. Or rather, I love when an incident is reported as “human error”, because the story is inevitably more nuanced than that. Severe incidents are always the result of multiple things going wrong simultaneously. In this case, it was an operator mistake, insufficient radios and badges for responders, and lack of an established procedure for alerting utility customers.

What Can Kill a Game Faster Than Darth Vader? Answer: Latency

A detailed exploration of latency and how it can impact online services, especially games.

Online gaming customers are twice as likely to abandon a game when they experience a network delay of 50 additional milliseconds

How predictive maintenance can eliminate downtime

Say “eliminate downtime” and I’ll be instantly skeptical, but this article is a nice overview of predictive maintenance systems in datacenters.

Data centers use complex hardware that presents unforeseen problems that calendar-based maintenance checks simply cannot anticipate.

Outages

MedStar Health
HipChat
- Yet another HipChat outage for beleagered Atlassian, this one after the above-linked postmortem.
Spotify
Wisconsin (US state) voting system
- More trouble in the US primary election process.
Instagram
Sydney, AU Digital radio stations
- The issue was traced back to Telstra, which had a major equipment failure inside the North Sydney exchange.
Telstra
- Telstra’s free data day correlated with lots of complaints of slowdowns by Telstra users. They surpassed the previous free data day’s data transfer (1.8 PB) by 4pm AEST.
Sprint

SRE Weekly Issue #17

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues