SRE Weekly Issue #491

A message from our sponsor, Spacelift:

Infrastructure Security Virtual Event – This Wednesday, August 27
Join the IaCConf community on August 27 for a free virtual event that dives into IaC security best practices and real-world stories. Hear from three speakers on:

  • Taking a Platform Approach to Safer Infrastructure
  • How Tagged, Vetted Modules Can Transform IaC Security Posture
  • Securing IaC Provisioning Pipelines with PR Automation Best Practices

Register for event, join the community, and level-up your IaC practices!

Register for free

This 2-part episode of The VOID Podcast is just awesome, and well worth a listen. The conversation is framed as a retrospective of a simulated incident, with a high level of expertise and experience in the incident participants and the retrospective facilitator. I have a lot to think about, especially the discussion of overload and the four ways people react to it.

  Courtney Nash — The VOID Podcast, with guests Sarah Butt, Eric Dobbs, Alex Elman, and Hamed Silatani

Discover how tail sampling in OpenTelemetry enhances observability, reduces costs, and captures critical traces for faster detection and smarter system monitoring.

   Rishab Jolly — DZone

Datadog has evolved their time series storage through five generations before, and now they’re on the sixth. Click through to find out what motivated each change and what’s different this time around.

  Khayyam Guliyev, Duarte Nunes, Ming Chen, and Justin Jaffray — Datadog

Meta uses a tool to automatically estimate the risk level of a code change. They’ve used this to reduce the use of code freezes.

  Meta

The authors of Catchpoint’s SRE Report look back at their analysis and predictions related to AIOps, compared to how things are unfolding now.

  Leo Vasiliou and Denton Chikura — The New Stack

I love the approach and the level of detail in this article. They gave four LLMs access to observability data in a simulated infrastructure and asked them to troubleshoot a problem. It’s super useful to see the actual results from the LLMs.

  Lionel Palacin and Al Brown — ClickHouse

Uptime Labs goes meta by sharing the details of an incident they experienced last month, involving runaway creation of dynamic queues in RabbitMQ.

  Joe Mckevitt — Uptime Labs

I’m pretty impressed: Cloudflare published this article with a ton of detail on an incident, the day after it happened. A surge of traffic overloaded Cloudflare’s Data Center Internet connect links to AWS’s us-east-1 region.

  David Tuber, Emily Music, Bryton Herdes — Cloudflare

Updated: August 24, 2025 — 9:42 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme