SRE Weekly Issue #494

SRE Weekly will be on hiatus for the next 6 weeks while I’m on medical leave.

If all goes to plan, I’ll be donating a kidney for a loved one later this week, reducing my internal redundancy to help them respond to their own internal renal incident. If you’re interested, I invite you to learn more about kidney donation. It’s fascinating!

2025 VOID Incident Management Survey

Courtney Nash over at The VOID has launched an in-depth survey of incident management practices in tech. Please consider taking the time to fill out this survey. We all stand to benefit hugely from the information it will gather.

Courtney Nash

The VOID Newsletter, September 2025

Speaking of The VOID, the first bit of the September issue of the VOID Newsletter stood out to me:

Back in June, Salesforce had what appeared to be a pretty painful Heroku outage. About a month later, tech blogger Gergely Orosz posted about the incident on BlueSky. I’m bringing this up now because I’ve had over a month to chew on his commentary and I’m still mad about it. As someone who deals in reading public incident reports as a primary feature of my work, I find nothing more infuriating than people arm chair quarterbacking other organizations’ incidents and presuming they actually have any idea_ what really happened_.

As it happens, I also commented on the similarity of Salesforce’s incident to a Datadog incident from the past in issue 482.

I’m with Courtney Nash: we really have to be careful how we opine on public incident write-ups. Not only is it important to avoid blame and hindsight bias, but we also need to be careful not to disincentivize companies from posting public incident write-ups. I highly recommend clicking through to read Courtney’s full analysis.

Courtney Nash

What are Error Budgets? A Guide to Managing Reliability

This guide explains what error budgets are, how to manage them effectively, what to look out for, and how they differ from SLOs.

Includes sections on potential pitfalls, real-world examples, and impact on company culture.

Nawaz Dhandala — OneUpime

Observability for the Invisible: Tracing Message Drops in Kafka Pipelines

This article explores how backend engineers and DevOps teams can detect, debug, and prevent message loss in Kafka-based streaming pipelines using tools like OpenTelemetry, Fluent Bit, Jaeger, and dead-letter queues.

Prakash Wagle — DZone

Scaling Prometheus: Managing 80M Metrics Smoothly

Faced with 80 million time series, these folks found that Statsd + InfluxDB weren’t cutting it, so they switched to Prometheus.

Accessibility note: this article contains a table of text in an image with no alt text.

Kapil

A deep dive into Cloudflare’s September 12, 2025 dashboard and API outage

How do these folks keep producing such detailed write-ups the day after an incident?

Tom Lianza and Joaquin Madruga — Cloudflare

Nothing fails like a history of success

The author ties a recent outage in San Francisco’s BART transit service to a couple of previous incidents by a common thread: confidence placed in a procedure that had been performed successfully previously.

This article also links to BART’s memo which is surprisingly detailed and a great read.

Lorin Hochstein

How we sped up code search for Graphite Chat

The folks at Graphite take us through their discovery of why code search is difficult and the strategies they employed to solve it.

Brandon Willett — Graphite

SRE Weekly Issue #494

Subscribe

RSS

Mastodon

Search Issues