This 2-part episode of The VOID Podcast is just awesome, and well worth a listen. The conversation is framed as a retrospective of a simulated incident, with a high level of expertise and experience in the incident participants and the retrospective facilitator. I have a lot to think about, especially the discussion of overload and the four ways people react to it.
Courtney Nash — The VOID Podcast, with guests Sarah Butt, Eric Dobbs, Alex Elman, and Hamed Silatani
Discover how tail sampling in OpenTelemetry enhances observability, reduces costs, and captures critical traces for faster detection and smarter system monitoring.
Rishab Jolly — DZone
Datadog has evolved their time series storage through five generations before, and now they’re on the sixth. Click through to find out what motivated each change and what’s different this time around.
Khayyam Guliyev, Duarte Nunes, Ming Chen, and Justin Jaffray — Datadog
Meta uses a tool to automatically estimate the risk level of a code change. They’ve used this to reduce the use of code freezes.
Meta
The authors of Catchpoint’s SRE Report look back at their analysis and predictions related to AIOps, compared to how things are unfolding now.
Leo Vasiliou and Denton Chikura — The New Stack
I love the approach and the level of detail in this article. They gave four LLMs access to observability data in a simulated infrastructure and asked them to troubleshoot a problem. It’s super useful to see the actual results from the LLMs.
Lionel Palacin and Al Brown — ClickHouse
Uptime Labs goes meta by sharing the details of an incident they experienced last month, involving runaway creation of dynamic queues in RabbitMQ.
Joe Mckevitt — Uptime Labs
I’m pretty impressed: Cloudflare published this article with a ton of detail on an incident, the day after it happened. A surge of traffic overloaded Cloudflare’s Data Center Internet connect links to AWS’s us-east-1 region.
David Tuber, Emily Music, Bryton Herdes — Cloudflare