SRE Weekly Issue #211

View on sreweekly.com

Articles

SREcon20 Asia/Pacific

SRECon20 Asia/Pacific is rescheduled to September 7–9, 2020.

Business continuity at Slack: Keeping our customers up and running during COVID-19

This article has a definite marketing slant. It’s nonetheless interesting to see how Slack is handling the situation.

Cal Henderson and Robby Kwok, Slack

Journey into Observability: Glitch’s journey

I love this gem:

I’m not surprised companies that are far into their observability journey start advocating for testing in production – once you have the data and you can slice & dice it as you see fit, testing in production seems like a totally reasonable thing to do.

Mads Hartmann

Lessons in Distributed Communication From Incident Response

With many companies suddenly shifting into figuring out how to become distributed organizations overnight, we can learn many lessons by looking at incident response patterns.

George Miranda — PagerDuty

When correlation (or lack of it) can be causation

Today’s post is a double header. I’ve chosen two papers from NSDI’20 that are both about correlation.

Paper #1 is a tool that helps identify when files A and B are often changed at the same time, and warns you if you forgot B.

Paper #2 is a tool for finding correlated failure risks that threaten reliability.

Mehta et al. — NSDI’20 (original paper #1)
Zhai et al. — NSDI’20 (original paper #2)
Adrian Colyer — The Morning Paper (summaries)

Great Incident Response Requires 3 Major Components

The components from the article are:

Ability to recognize how bad the situation really is, and prioritize it
Effective communication skills
Compassionate responses to mistakes and a learning mindset

Hannah Culver — Blameless

Announcing Failover Conf

We’re pleased to announce Failover Conf, a conference focused on building resilient systems. The conference will be held online on April 21 and session submissions will be accepted through March 23.

CFP open through March 23.

Gremlin

Grow your blame-free culture with these postmortem best practices | FireHydrant

There are some good tips in here, especially if you’re new to this.

Mandy Mak

How network automation helps Fastly support the world’s biggest live-streaming moments

Fastly’s APS tool (Auto Peer Slasher) detects when a link is nearing saturation and automatically reroutes traffic through a different interface.

Ryan Landry — Fastly

Full disclosure: Fastly is my employer.

SRE Weekly Issue #211

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

A message from our sponsor, VictorOps:

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues