SRE Weekly Issue #211

A message from our sponsor, VictorOps:

What’s the most important metric for SRE? Well, Splunk Cloud Platform SRE, Jonathan Schwietert, argues that it’s customer happiness. On Tuesday (03/17), join Splunk + VictorOps for a webinar about using SRE and observability to build customer-first applications and services:


SRECon20 Asia/Pacific is rescheduled to September 7–9, 2020.

This article has a definite marketing slant. It’s nonetheless interesting to see how Slack is handling the situation.

Cal Henderson and Robby Kwok, Slack

I love this gem:

I’m not surprised companies that are far into their observability journey start advocating for testing in production – once you have the data and you can slice & dice it as you see fit, testing in production seems like a totally reasonable thing to do.

Mads Hartmann

With many companies suddenly shifting into figuring out how to become distributed organizations overnight, we can learn many lessons by looking at incident response patterns.

George Miranda — PagerDuty

Today’s post is a double header. I’ve chosen two papers from NSDI’20 that are both about correlation.

Paper #1 is a tool that helps identify when files A and B are often changed at the same time, and warns you if you forgot B.

Paper #2 is a tool for finding correlated failure risks that threaten reliability.

Mehta et al. — NSDI’20 (original paper #1)
Zhai et al. — NSDI’20 (original paper #2)
Adrian Colyer — The Morning Paper (summaries)

The components from the article are:

Ability to recognize how bad the situation really is, and prioritize it
Effective communication skills
Compassionate responses to mistakes and a learning mindset

Hannah Culver — Blameless

We’re pleased to announce Failover Conf, a conference focused on building resilient systems. The conference will be held online on April 21 and session submissions will be accepted through March 23.

CFP open through March 23.


There are some good tips in here, especially if you’re new to this.

Mandy Mak

Fastly’s APS tool (Auto Peer Slasher) detects when a link is nearing saturation and automatically reroutes traffic through a different interface.

Ryan Landry — Fastly

Full disclosure: Fastly is my employer.


Updated: March 15, 2020 — 11:01 pm
SRE WEEKLY © 2015 Frontier Theme