SRE Weekly Issue #38

Welcome to the many new subscribers that joined in the past week. I’m not sure who I have to thank for the sudden surge, but whoever you are, thanks!


Got ChatOps? Download the free eBook from O’Reilly Media and VictorOps:

Want even more? Meet the author on Sept 8th in a live stream event:


What can the fire service learn about safety from the aviation industry? A 29-year veteran in the fire service answers that question in detail. We could in turn apply all of those lessons to operating complex infrastructures.

I’m surprised that I haven’t come across the term “High Reliability Organization” before reading the previous article. Here’s Wikipedia’s article on HROs.

A high reliability organization (HRO) is an organization that has succeeded in avoiding catastrophes in an environment where normal accidents can be expected due to risk factors and complexity.

Etsy instruments their deployment system to strike a vertical line on their graphite graphs for every deploy. This helps them quickly figure out whether a deploy happened right before a key metric took a turn for the worse.

A really interesting dive into the world of subsea network cables and the impact that cuts can have on businesses worldwide.

How closely can you really mimic production in your testing environments? In a way we’re all testing in production, and this article talks about getting that fact out in the open.

I wrote this article on my terrible little blog back in 2008 — forgive the horrid theme and apparently broken unicode support. This was well before I worked in Linden Lab’s Ops team, back when I was making a living as a user selling content in Second Life. What’s interesting to me in reading this article 8 years later is the user perspective on the impact of the string of bad outages, and especially Linden’s poor communication during outages.

More on the impact of Delta Airline’s major outage last month.

Most often a catastrophic failure is not due to a lack of standards, but a breakdown or circumvention of established procedures that compounded into a disastrous outcome. Multilayer complex systems outages signify management failure to drive change and improvement.


Updated: September 4, 2016 — 11:38 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme