SRE Weekly Issue #18

SRECon16 was awesome! Sorry for the light issue this week — still recovering from my con-hangover. I had an incredible time, and I enjoyed meeting many of you, both old subscribers and new. Thank you all for your support! When USENIX posts their recordings, I’ll share links to some of my favorite talks.

QotW, from Charity Majors’s day 1 closing keynote (paraphrased):

There are no bad decisions. We make the best decisions we can with the information we have at the time.

Love it. The second QotW was from Rachel Kroll’s day 1 opening keynote, which included a hilarious and cringe-worthy story of investigating a very well-hidden bug with an incredibly bizarre set of symptoms. I can’t recommend enough watching the keynotes, and, well, every talk.

More content next week, after I’ve caught up on my RSS feeds. Thanks again for the huge amount of support you all have shown me — all 250+ of you (and that’s just email subscribers)!

Articles

Telstra exec Kate McKenzie detailed some findings from internal investigations into the recent spate of Telstra incidents. There’s some nice detail here, including possible remediation items and an implication that Telstra is using a blameless retrospective process.

This is a short but excellent template for incident retrospectives in the form of a series of questions. A great place to start if you’re looking to improve your retrospective process.

Etsy’s morgue, a tool for tracking information related to postmortem investigations.

A rockin’ postmortem detailing the failure and recovery of a 1.7 PB filesystem, featuring the creation of a 3 TB ramdisk(!) to speed up the operation.

Thanks to phill-atlassian on hangops #incident_response for this one.

Outages

Updated: April 10, 2016 — 7:58 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme