Articles
SRECon20 Asia/Pacific is rescheduled to September 7–9, 2020.
This article has a definite marketing slant. It’s nonetheless interesting to see how Slack is handling the situation.
Cal Henderson and Robby Kwok, Slack
I love this gem:
I’m not surprised companies that are far into their observability journey start advocating for testing in production – once you have the data and you can slice & dice it as you see fit, testing in production seems like a totally reasonable thing to do.
Mads Hartmann
With many companies suddenly shifting into figuring out how to become distributed organizations overnight, we can learn many lessons by looking at incident response patterns.
George Miranda — PagerDuty
Today’s post is a double header. I’ve chosen two papers from NSDI’20 that are both about correlation.
Paper #1 is a tool that helps identify when files A and B are often changed at the same time, and warns you if you forgot B.
Paper #2 is a tool for finding correlated failure risks that threaten reliability.
Mehta et al. — NSDI’20 (original paper #1)
Zhai et al. — NSDI’20 (original paper #2)
Adrian Colyer — The Morning Paper (summaries)
The components from the article are:
Ability to recognize how bad the situation really is, and prioritize it
Effective communication skills
Compassionate responses to mistakes and a learning mindset
Hannah Culver — Blameless
We’re pleased to announce Failover Conf, a conference focused on building resilient systems. The conference will be held online on April 21 and session submissions will be accepted through March 23.
CFP open through March 23.
Gremlin
There are some good tips in here, especially if you’re new to this.
Mandy Mak
Fastly’s APS tool (Auto Peer Slasher) detects when a link is nearing saturation and automatically reroutes traffic through a different interface.
Ryan Landry — Fastly
Full disclosure: Fastly is my employer.