SRE Weekly Issue #475

I haven’t seen this level of detail in an article on anomaly detection in quite awhile. Still, the math is very approachable even if you slept through stats class.

  Ivan Shubin — Booking.com

TL;DR: The Power of Knowledge Overlap in Incident Response

There’s an anecdote in this one that’s really making me think.

  Hamed Silatani — Uptime Labs

One of the criticisms leveled at resilience engineering is that the insights that the field generates aren’t actionable […]

This article argues that we still need the unactionable but good models, otherwise we’ll get actionable but wrong models.

  Lorin Hochstein

Datadog has put a lot of thought and effort into managing their massive Kafka workload. My favorite part of this article was the bit about accidentally zip-bombing themselves with highly compressible data.

  Guillaume Bort — Datadog

This one covers four techniques for rerouting customer traffic after a region failure using AWS’s Route 53… themed after the TV show The Good Place. It’s been quite awhile since I watched the show, but I still found the article pretty useful.

  Seth Elliot — Arpio

This article asks what we’re really looking to get by defining an incident severity scale, and proposes an alternative scale based on incident complexity.

  Dan Slimmon

I love this idea of tracking configuration changes as observability data. I’ve been through plenty of incidents in which I wish I had it.

  Yevgeny Pats — CloudQuery

A short and sweet article packed with some useful nuggets. My favorite is the section near the end on timeouts.

  Hemant Burman — Insights

Updated: May 4, 2025 — 9:48 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme