SRE Weekly Issue #38

Welcome to the many new subscribers that joined in the past week. I’m not sure who I have to thank for the sudden surge, but whoever you are, thanks!

View on sreweekly.com

Articles

The Fire Service and the Aviation Industry – Firefighter Safety – Crew Resource Management

What can the fire service learn about safety from the aviation industry? A 29-year veteran in the fire service answers that question in detail. We could in turn apply all of those lessons to operating complex infrastructures.

Wikipedia: High Reliability Organization

I’m surprised that I haven’t come across the term “High Reliability Organization” before reading the previous article. Here’s Wikipedia’s article on HROs.

A high reliability organization (HRO) is an organization that has succeeded in avoiding catastrophes in an environment where normal accidents can be expected due to risk factors and complexity.

Tracking Every Release

Etsy instruments their deployment system to strike a vertical line on their graphite graphs for every deploy. This helps them quickly figure out whether a deploy happened right before a key metric took a turn for the worse.

Undersea cables keep global enterprise networks afloat

A really interesting dive into the world of subsea network cables and the impact that cuts can have on businesses worldwide.

Testing in production comes out of the shadows

How closely can you really mimic production in your testing environments? In a way we’re all testing in production, and this article talks about getting that fact out in the open.

Two Suggestions to Help SL Scale

I wrote this article on my terrible little blog back in 2008 — forgive the horrid theme and apparently broken unicode support. This was well before I worked in Linden Lab’s Ops team, back when I was making a living as a user selling content in Second Life. What’s interesting to me in reading this article 8 years later is the user perspective on the impact of the string of bad outages, and especially Linden’s poor communication during outages.

Delta says it lost $100 million in revenue due to big outage

More on the impact of Delta Airline’s major outage last month.

We’re learning the wrong lessons from airline IT outages

Most often a catastrophic failure is not due to a lack of standards, but a breakdown or circumvention of established procedures that compounded into a disastrous outcome. Multilayer complex systems outages signify management failure to drive change and improvement.

Outages

Salesforce
- Salesforce.com was down or impared for several hours.
  
  Full disclosure: Salesforce is the parent company of my employer, Heroku.
dynamodb
Telstra Mail
Google Cloud Platform
- Normally I don’t include single-zone failures in EC2 or GCP, but this one has an extremely interesting and detailed postmortem.
EA (FIFA 16 and Battlefield 1 Beta)
Vodafone (Ireland)
Interpublic Group (Hollywood PR agency)
Vesk
- The Register noted that Vesk bragged about their 100% uptime even after the outage — including for all of 2016. From Vesk’s recently-changed about page:
  
  VESK has hit 100% uptime for all 2012, 2013, 2014, 2015 and 2016.”
PlayStation Network
PagerDuty
- PagerDuty is currently unable to process some inbound events. We are investigating the cause.
Telkom (South Africa telecom)
- The company cited suspected sabotage and offered a monetary reward.
Washington, DC 911 system
- Emergency services were knocked out for 90 minutes after a contract worker mistakenly hit the emergency shut-off button. The phrase “human error” is being tossed about.

SRE Weekly Issue #38

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

SPONSOR MESSAGE

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues