SRE Weekly Issue #28

A more packed issue this week to make up for missing last week. This issue is coming to you from beautiful Cape Cod, where my family and I are vacationing for the week.

View on sreweekly.com

Articles

Postmortem-Report-Reviews/2016-04-14-lexelby-Google-Compute-Engine-2016-04-11.md

In April, Google Compute Engine suffered a major outage that was reported here. I wrote up this review for the Operations Incident Board’s Postmortem Report Reviews project.

The Netflix Tech Blog: Netflix Billing Migration to AWS

Migration of a service without downtime can be an incredibly challenging engineering feat. Netflix details their effort to migrate their billing system complete with its rens of terabytes of RDBMS data into EC2.

Our primary goal was to define a secure, resilient and granular path for migration to the Cloud, without impacting the member experience.

How Ransomware Changes Backup and Disaster Recovery

Ransomware is designed to really ruin your day. It not only corrupts your in-house data; it also tries to encrypt your backup. Even if it’s off-site. Does your backup/recovery strategy stand up to this kind of failure?

Making the Case for Real-Time Incident Management: Downtime Data and DevOps

VictorOps gives us this shiny, number-filled PDF that you can use as ammunition to convince execs that downtime really matters.

DevOps Students Learn the Value of Uptime With 3 a.m. Calls

Students of Holberton School‘s full-stack engineer curriculum are on-call and actually get paged in the middle of the night. Nifty idea. Why should training in on-call only be on-the-job?

Safety Moment – Rumble Strips

I think the rumble strip is a near-perfect safeguard.

That’s Pre-Accident Podcast’s Todd Conklin on rumble strips, the warning tracks on the sides of highways. This short (4-minute) podcast asks the question, can we apply the principles behind rumble strips in our infrastructures?

FCC Adopts Rules to Promote Reliable Submarine Cable Communications Infrastructure

The FCC adds undersea cable operators to the list of mandatory reporters to the NORS (Network Outage Reporting System). But companies such as AT&T claim that the reporting will be of limited value, since outages that have no end-user impact (due to redundant underseas links) must still be reported.

Updated high availability and disaster recovery app design guidance

Microsoft updated its article on designing highly available apps using Azure. These kinds of articles are important. In theory, no one ought to go down just because one EC2 or Azure region goes down.

Reducing Alert Noise

SignalFX published this four-part series on avoiding spurious alerts in metric-based monitoring systems. The tutorial bits are specific to SignalFX, but the general principles could be applied to any metric-based alerting system.

Thanks to Aneel at SignalFX for this one.

Outages

Baltimore, MD, USA 911 (emergency services)
- Verizon blames a routing error.
HBO NOW
- HBO Now’s stream of Game of Thrones sputtered and died just as the most anticipated episode of the season spooled up.
Bitfinex (Bitcoin exchange)
- The outage purportedly resulted in a Bitcoin price dip.
Telia (transit provider)
- This one was big. The Register reports in this article that Telia mistakenly routed Europe’s traffic to Hong Kong. Many services and providers were impacted including CloudFlare, Slack, and AWS’s eu-west-1 region.
Blizzard
LinkedIn
- Just two days after the Microsoft acquisition.
Verizon (Florida, USA)
Youtube
Spotify
PlayStation Network
Asos.com
Twitter (India)
US Air Force Inspector General’s database
- Thanks to Niall for this one.

SRE Weekly Issue #28

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues