SRE Weekly Issue #2

View on sreweekly.com

I’m still working out all of the kinks for SRE Weekly, so the issue for this “week” is hot on the heels of the last one as I clear out my backlog of articles. Coming soonish: decent CSS.

Articles

VictorOps’s 2015 State of On-Call Report

Managing the burden of on-call is critical to any organization’s incident response. Tired incident responders make mistakes, miss pages, and don’t perform as effectively. In SRE, we can’t afford to ignore this stuff. Thanks to VictorOps for doing the legwork on this!

LinkedIn’s Active/Active Evolution

A talk at QCon from LinkedIn about how they spread out to multiple datacenters.

Disaster recovery planning: Where virtualisation can help

A review of designing a disaster recovery solution, and where virtualization fits in the picture.

Day 5 – ELK Operations and Administration

Not strictly directly related to reliability (unless you’re providing ELK as a service, of course), but I’ve found ELK to be very valuable in detecting and investigating incidents. Scaling ELK well can be an art, and in this article, Etsy describes how they set theirs up.

DRaaS: How Can Providers Make Life Easier And Less Expensive For Companies?

This series of articles is actually the first time I’d seen mention of DRaaS. I’m not sure I’m convinced that it makes sense to hire an outside firm to handle your DR, but it’s an interesting concept.