SRE Weekly Issue #28

A more packed issue this week to make up for missing last week. This issue is coming to you from beautiful Cape Cod, where my family and I are vacationing for the week.

Articles

In April, Google Compute Engine suffered a major outage that was reported here. I wrote up this review for the Operations Incident Board’s Postmortem Report Reviews project.

Migration of a service without downtime can be an incredibly challenging engineering feat. Netflix details their effort to migrate their billing system complete with its rens of terabytes of RDBMS data into EC2.

Our primary goal was to define a secure, resilient and granular path for migration to the Cloud, without impacting the member experience.

Ransomware is designed to really ruin your day. It not only corrupts your in-house data; it also tries to encrypt your backup. Even if it’s off-site. Does your backup/recovery strategy stand up to this kind of failure?

VictorOps gives us this shiny, number-filled PDF that you can use as ammunition to convince execs that downtime really matters.

Students of Holberton School‘s full-stack engineer curriculum are on-call and actually get paged in the middle of the night. Nifty idea. Why should training in on-call only be on-the-job?

I think the rumble strip is a near-perfect safeguard.

That’s Pre-Accident Podcast’s Todd Conklin on rumble strips, the warning tracks on the sides of highways. This short (4-minute) podcast asks the question, can we apply the principles behind rumble strips in our infrastructures?

The FCC adds undersea cable operators to the list of mandatory reporters to the NORS (Network Outage Reporting System). But companies such as AT&T claim that the reporting will be of limited value, since outages that have no end-user impact (due to redundant underseas links) must still be reported.

Microsoft updated its article on designing highly available apps using Azure. These kinds of articles are important. In theory, no one ought to go down just because one EC2 or Azure region goes down.

SignalFX published this four-part series on avoiding spurious alerts in metric-based monitoring systems. The tutorial bits are specific to SignalFX, but the general principles could be applied to any metric-based alerting system.

Thanks to Aneel at SignalFX for this one.

Outages

Updated: June 26, 2016 — 10:43 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme