SRE Weekly Issue #475

Anomaly Detection in Time Series Using Statistical Analysis

I haven’t seen this level of detail in an article on anomaly detection in quite awhile. Still, the math is very approachable even if you slept through stats class.

Ivan Shubin — Booking.com

A Key Incident Response Skill That Can Reduce Resolution Time

TL;DR: The Power of Knowledge Overlap in Incident Response

There’s an anecdote in this one that’s really making me think.

Hamed Silatani — Uptime Labs

Good models protect us from bad models

One of the criticisms leveled at resilience engineering is that the insights that the field generates aren’t actionable […]

This article argues that we still need the unactionable but good models, otherwise we’ll get actionable but wrong models.

Lorin Hochstein

Achieving relentless Kafka reliability at scale with the Streaming Platform

Datadog has put a lot of thought and effort into managing their massive Kafka workload. My favorite part of this article was the bit about accidentally zip-bombing themselves with highly compressible data.

Guillaume Bort — Datadog

Failover Routing for Disaster Recovery – Ensuring Your Customers Get to The Good Place

This one covers four techniques for rerouting customer traffic after a region failure using AWS’s Route 53… themed after the TV show The Good Place. It’s been quite awhile since I watched the show, but I still found the article pretty useful.

Seth Elliot — Arpio

Incident SEV scales are a waste of time

This article asks what we’re really looking to get by defining an incident severity scale, and proposes an alternative scale based on incident complexity.

Dan Slimmon

The Lost Fourth Pillar of Observability – Config Data Monitoring

I love this idea of tracking configuration changes as observability data. I’ve been through plenty of incidents in which I wish I had it.

Yevgeny Pats — CloudQuery

Building the future of resilient tech: Lessons from two decades in SRE

A short and sweet article packed with some useful nuggets. My favorite is the section near the end on timeouts.

Hemant Burman — Insights

SRE Weekly Issue #475

Subscribe

RSS

Mastodon

Search Issues