SRE Weekly Issue #8


If you only read two articles this week, make it these first two. They’re excellent and exactly the kind of content I’m looking for. If you come across (or write!) anything that would go well in SRE Weekly, I’d love it if you’d toss a link my way.

Articles

Liz Fong-Jones, a Googler and co-chair of SRECon, describes a scale of activities SRE teams engage in, from the basics (keeping the service operating) to having the freedom to improve the service.

This is a really awesome paper. Two Googlers describe in detail the pitfalls of failover-based systems and explain how they design multi-homed active/active services. If Google has learned a lesson, we’d all do well to learn from it, too:

Our experience has been that bolting failover onto previously singly-homed systems has not worked well. These systems end up being complex to build, have high maintenance overhead to run, and expose complexity to users. Instead, we started building systems with multi-homing designed in from the start, and found that to be a much better solution. Multi-homed systems run with better availability and lower cost, and result in a much simpler system overall.

A review of CloudHarmony’s numbers on various cloud providers’ availability in 2015 versus 2014, along with a discussion of how customers deal with outages. I’m a little puzzled by this one:

That’s also partly why most public cloud workloads aren’t used for production or mission-critical applications.

I’m pretty sure plenty of mission-critical stuff is running in EC2, for example.

The team at parall.ax chose Lambda because there are no long-lived servers, and they could offload all the work of scaling their app up and down with demand to Amazon.

Randall Monroe takes on an important question: is it possible to siphon water from a Europa to Earth? Okay, the only relation to SRE is that a team of Google SREs submitted the question, but I really love What If.

VictorOps distilled their Minimum Viable Runbooks series (featured here previously) into a polished PDF in their usual high quality and style.

During an outage this week, Vodafone admitted that they forgot to update their status site. They are looking into an automated system to make updates during outages.

I’ve worked mostly jobs without compensation for on-call, but one with. Compensation is nice, but it was to offset a truly heinous level of pages, so it was small comfort. If you have any good articles about the merits and pitfalls of on-call compensation, please send them my way.

Outages

Lots of downtime this week, including some recurrences and some big names.

Updated: January 31, 2016 — 10:17 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme