SRE Weekly Issue #122

SPONSOR MESSAGE

Determining the right tools for your SRE team(s) can get confusing. So, VictorOps, InfluxData, and Grafana are putting on a webinar—May 16th, 1 pm ET—to help you build your SRE toolchain: http://try.victorops.com/SREWeekly/Webinar

Articles

After adopting a “full ownership” philosophy, this company faced burnout, with five or more separate developers on call simultaneously. Read about their awesome solution involving a shared on-call rotation staffed entirely by volunteers, spurred by the incentive of extra compensation.

Brian Scanlan — Intercom

What exactly is SRE and how does it relate to DevOps? Earlier this year, we (Liz Fong-Jones and Seth Vargo) launched a video series to help answer some of these questions and reduce the friction between the communities. This blog post summarizes the themes and lessons of each video in the series to offer actionable steps toward better, more reliable systems.

Liz Fong-Jones and Seth Vargo — Google

After a load test uncovered a scaling issue, they dug deep, finding issues with garbage collection settings, cascading failures, and an overeager retry strategy.

Val Markovic — LinkedIn

These tips cover the basics and will be especially useful for teams onboarding engineers that have never been on-call before.

This article examines a case study of an EMS company attempting to adopt a just culture policy. There’s a great discussion of why it’s not a good idea to lay blame on individuals when systemic problems may be far more important.

Larry Boxman and Paul LeSage — JEMS (Journal of Emergency Medical Services)

In this third and final article in a series, Xero lays out their process for analyzing incidents after the fact. Thanks to the Xero folks for being so open about your processes and for taking the time to write these articles!

Karthik Nilakant — Xero

I like the nifty heat maps with example distributed traces. Neat idea!

JBD — Google

Outages

SRE Weekly Issue #121

SPONSOR MESSAGE

Determining the right tools for your SRE team(s) can get confusing. So, VictorOps, InfluxData, and Grafana are putting on a webinar—May 16th, 1 pm ET—to help you build your SRE toolchain: http://try.victorops.com/SREWeekly/Webinar

Articles

This latest in the CRE Life Lessons series takes on dependencies and how they impact a service’s SLO in obvious and subtle ways.

Robert van Gent — Google

This company discovered that the benefits of microservices came with some significant downsides. Here’s how they turned to chaos testing to improve reliability.

Meredith Courtemanche — TechTaret

Keeping in mind that this is written by the CTO of Gremlin, it contains some good points about buying versus building your chaos engineering system. It would apply to other chaos engineering services too — if there were any.

Matt Fornaciari — Gremlin, Inc.

Even as an experienced Terraform user, I learned about some Terraform features I hadn’t been aware of.

Nic Jackson — Hashicorp

In issue #98, I linked to a recording of John Allspaw’s DOES17 talk. In case you didn’t have time to listen, here’s a transcript. If you didn’t have time to read the Stella Report, I highly recommend reading this as an intro to the major concepts therein.

John Allspaw

Outages

SRE Weekly Issue #120

SPONSOR MESSAGE

A combination of the right people and the right tools create SRE-friendly environments. See the hundreds of tools and integrations that already work with VictorOps to make your people better and help you maintain more reliable systems: http://try.victorops.com/SREWeekly/Tools

Articles

“You can OOM a single NUMA node” thus entered my list of things to worry about when a box seems to have plenty of memory but still goes off and slaughters innocent (but big) processes.

Rachel Kroll

In this podcast episode, the panelists hold a retrospective for the snow-related delay of DevOps Days Baltimore. Toward the end they go into the idea of reliability and single points of failure with respect to conference planning. My favorite quote in the show, from Nell Shamrell-Harrington:

Incident Management is never about technology — it’s a people.

Nell Shamrell-Harrington and Nathen Harvey

I really love this Who, Me? section from The Register.

Simon Sharwood — The Register

This article has a great discussion of how to get started with chaos engineering — and how to avoid biting off more than you can chew.

Jennifer Riggins — The New Stack

Beamer is a stateless datacenter load balancer supporting both TCP and Multipath TCP (MPTCP). It manages to keep the load balancers stateless by taking advantage of connection state already held by servers.

Super-clever! The LB does keep state, but the size of the state is constant, unrelated to the number of connections flowing through it.

Adrian Colyer — summary, Olteanu et al. — original paper

Sometimes it’s worthwhile to lay everything out and describe just exactly what we’re up against as SREs. The analogies here are pretty awesome. Read this for a hefty dose of cynicism about the state of our increasingly computer-driven world.

Peter Welch

Outages

SRE Weekly Issue #119

SPONSOR MESSAGE

A combination of the right people and the right tools create SRE-friendly environments. See the hundreds of tools and integrations that already work with VictorOps to make your people better and help you maintain more reliable systems: http://try.victorops.com/SREWeekly/Tools

Articles

If you missed the STELLA Report, released last fall during Velocity NYC by John Allspaw, Richard Cook, and David Woods, this podcast is a great intro. And even if you did catch it, it’s well worth a listen. The Food Fight folks interview John Allspaw and there’s some really stellar (see what I did there) back-and-forth discussion.

Alan Kraft and Nathen Harvey

Great idea. This reminds me of a couple jobs back where I rigged up our infrastructure to log every command entered at the shell into a Slack channel.

Rachel Kroll

This excerpt from the Google SRE book is worth reading if only for this nifty idea for graceful degradation:

Other techniques include […] choosing a consistent subset of clients to receive errors, preserving a good user experience for the remainder.

In part two of this story, the author causes their first incident (oops) and subsequently significantly improves the performance of the system in question (cool!).

Evan Smith — Hosted Graphite

An introduction to blue/green deployments including the risks it helps to alleviate.

Mark Henke — Rollout.io

instead of giving guidelines on how and when to do things, I am going to lay out a few ideas on how to respond to alerts and leave it up to you to decide what methods work best for your app and your organization.

Peter Christian Fraedrich — Capital One

Especially in Ubuntu, it’s harder than it used to be to get a core dump, thanks to apport and the like.

Julia Evans

NCDEX, a stock exchange in Mumbai, India, has been operating out of its disaster recovery site for two weeks. Unfortunately, it looks like performance is not on par with the standard site.

Rajesh Bhayani — Business Standard

You may have heard that a Southwest flight suffered a catastrophic engine failure that left one passenger dead. The day after my family flew a Southwest flight to Orlando. Yikes.

The air traffic control audio recording is incredible to listen to. The pilot that was on the radio was cool and calm as she responded to the incident and arranged for landing and emergency ground crews.

Outages

SRE Weekly Issue #118

Sorry, a little late this week as my family and I head off to Disney World! No issue next Sunday, and I’ll see you all on April 29.

SPONSOR MESSAGE

SRE isn’t just a dedicated role. SRE is a behavior and culture purpose-built to improve collaboration and promote accountability. In the following article, Dan Hopkins, VP of Engineering at VictorOps, takes you on a journey to creating a positive internal perception of SRE within your organization: http://try.victorops.com/SREWeekly/sre-is-a-behavior

Articles

I have different thoughts than the author on a few of the points, but it’s very useful and enlightening to see their thought process.

Will Gallego

What it says on the tin. Pretty neat CI setup!

Bridget Lane — USA Today

Full disclosure: Fastly, my employer, is mentioned.

“Why-run” mode is Chef’s “do nothing” or “dry run” mode. As it turns out, it may not be so useful when trying to figure out what Chef will do.

Julian Dunn — Chef

Lots of deep thoughts on what makes on-call hard and what we can do about it.

Cody Wilbourn

One little typo is all it took.

Rachel Kroll

Q&A about a task queuing system that freezes up if the queue is kept full at all times.

A new hire tells us what it’s like to get up to speed as an SRE at Hosted Graphite.

Evan Smith — Hosted Graphite

Outages

  • Discord
  • Mauritania
    • Another one of those “oh look an entire country lost its Internet, this is the first time that’s ever happened!!1” articles.
  • Twitter
A production of Tinker Tinker Tinker, LLC Frontier Theme