SRE Weekly Issue #491

Uptime Labs and the Multi-Party Dilemma (Part I)

This 2-part episode of The VOID Podcast is just awesome, and well worth a listen. The conversation is framed as a retrospective of a simulated incident, with a high level of expertise and experience in the incident participants and the retrospective facilitator. I have a lot to think about, especially the discussion of overload and the four ways people react to it.

Courtney Nash — The VOID Podcast, with guests Sarah Butt, Eric Dobbs, Alex Elman, and Hamed Silatani

Tail Sampling: The Future of Intelligent Observability in Distributed Systems

Discover how tail sampling in OpenTelemetry enhances observability, reduces costs, and captures critical traces for faster detection and smarter system monitoring.

Rishab Jolly — DZone

Evolving our real-time timeseries storage again: Built in Rust for performance at scale

Datadog has evolved their time series storage through five generations before, and now they’re on the sixth. Click through to find out what motivated each change and what’s different this time around.

Khayyam Guliyev, Duarte Nunes, Ming Chen, and Justin Jaffray — Datadog

Diff Risk Score: AI-driven risk-aware software development

Meta uses a tool to automatically estimate the risk level of a code change. They’ve used this to reduce the use of code freezes.

SRE Weekly Issue #491

Subscribe

RSS

Mastodon

Search Issues

A message from our sponsor, Spacelift:

Subscribe

RSS

Mastodon

Search Issues