SRE Weekly Issue #494

SRE Weekly will be on hiatus for the next 6 weeks while I’m on medical leave.

If all goes to plan, I’ll be donating a kidney for a loved one later this week, reducing my internal redundancy to help them respond to their own internal renal incident. If you’re interested, I invite you to learn more about kidney donation. It’s fascinating!



Courtney Nash over at The VOID has launched an in-depth survey of incident management practices in tech. Please consider taking the time to fill out this survey. We all stand to benefit hugely from the information it will gather.

  Courtney Nash

Speaking of The VOID, the first bit of the September issue of the VOID Newsletter stood out to me:

Back in June, Salesforce had what appeared to be a pretty painful Heroku outage. About a month later, tech blogger Gergely Orosz posted about the incident on BlueSky. I’m bringing this up now because I’ve had over a month to chew on his commentary and I’m still mad about it. As someone who deals in reading public incident reports as a primary feature of my work, I find nothing more infuriating than people arm chair quarterbacking other organizations’ incidents and presuming they actually have any idea_ what really happened_.

As it happens, I also commented on the similarity of Salesforce’s incident to a Datadog incident from the past in issue 482.

I’m with Courtney Nash: we really have to be careful how we opine on public incident write-ups. Not only is it important to avoid blame and hindsight bias, but we also need to be careful not to disincentivize companies from posting public incident write-ups. I highly recommend clicking through to read Courtney’s full analysis.

  Courtney Nash

This guide explains what error budgets are, how to manage them effectively, what to look out for, and how they differ from SLOs.

Includes sections on potential pitfalls, real-world examples, and impact on company culture.

  Nawaz Dhandala — OneUpime

This article explores how backend engineers and DevOps teams can detect, debug, and prevent message loss in Kafka-based streaming pipelines using tools like OpenTelemetry, Fluent Bit, Jaeger, and dead-letter queues.

   Prakash Wagle — DZone

Faced with 80 million time series, these folks found that Statsd + InfluxDB weren’t cutting it, so they switched to Prometheus.

Accessibility note: this article contains a table of text in an image with no alt text.

  Kapil

How do these folks keep producing such detailed write-ups the day after an incident?

  Tom Lianza and Joaquin Madruga — Cloudflare

The author ties a recent outage in San Francisco’s BART transit service to a couple of previous incidents by a common thread: confidence placed in a procedure that had been performed successfully previously.

This article also links to BART’s memo which is surprisingly detailed and a great read.

  Lorin Hochstein

The folks at Graphite take us through their discovery of why code search is difficult and the strategies they employed to solve it.

  Brandon Willett — Graphite

SRE Weekly Issue #493

A message from our sponsor, Shipfox:

Shipfox supercharges GitHub Actions – no workflow changes, 30-min setup.

  • 2x faster builds with better CPU, faster disks & high-throughput caching
  • 75% lower costs with shorter jobs and better price-per-performance
  • Full CI observability with test/job speed and reliability

👉 See how it works: https://www.shipfox.io?utm_source=SREWeekly&utm_campaign=issue493

I like how this goes deep on the ways proxies can manage many connections at once, like SO_REUSEPORT.

  Mitendra Mahto

Here I want to talk about two classes of problems where accountability is a poor solution to addressing the problem, where the OceanGate accident falls into the second class.

  Lorin Hochstein

This one has so many lessons we can learn from that it might as well be about IT infrastructure.

Research from high-reliability organizations reveals that individual errors are almost always symptoms of deeper systemic problems.

   Muhammad Abdullah Khan — KevinMD.com

But Vibe Coding introduces real risks, particularly around resilience, that are worth examining before we place too much faith in it.

I really appreciate the way the author methodically lays out their points, including through the concept of competitive and complementary artifacts.

  Stuart Rimell — Uptime Labs

I really enjoyed the part about the Google interview question. That one’s going to have me thinking for awhile.

  Jos Visser

Here’s a great overview of why time and ordering are so important (and difficult) in distributed systems.

  Sid — The Scalable Thread

I really like the way the author teases apart the true, practical meaning of “eventual consistency”. The example of Amazon shopping carts is especially illuminating.

  Uwe Friedrichsen

SRE Weekly Issue #492

A message from our sponsor, Observe, Inc.:

Built on a scalable, cost-efficient data lake, Observe delivers AI-powered observability at scale. With its context-aware Knowledge Graph and AI SRE, Observe enables Capital One, Topgolf, and Dialpad to ingest hundreds of terabytes daily and resolve issues faster—at drastically lower cost.

Learn how Observe is redefining observability for the AI era.

Three days ago, PagerDuty had a major incident, severely impacting incident creation, notifications, and more. Linked above is a discussion on reddit’s r/sre with lots of takes on how folks deal with this kind of thing.

  u/Secret-Menu-2121 and others

It’s not telepathy; it’s about building common ground. This article explains what that means and the components that comprise common ground in an incident.

  Stuart Rimell — Uptime Labs

An introduction to database connection pooling in general, and RDS proxy in specific, complete with a Terraform snippet.

  David Kraytsberg — Klaviyo

This article explores the difference between simple and easy, their relation to complexity, and the effect of production pressure.

  Lorin Hochstein

What does “High Availability” actually mean? It turns out that it can mean different things to different people, and it’s important to look deeper.

  Teiva Harsanyi — The Coder Cafe

This short but sweet untitled LinkedIn post goes into the importance of understanding the entire context rather than focusing on an individual’s mistakes or omissions.

  Ron Gantt

Whether you’re just getting started implementing SLIs and SLOs or you’re a veteran, you’ll want to read this one. It charts the progress of organizations as they successively refine and mature their SLIs, and more importantly, it explains why the later stages matter.

  Alex Ewerlöf

SRE Weekly Issue #491

A message from our sponsor, Spacelift:

Infrastructure Security Virtual Event – This Wednesday, August 27
Join the IaCConf community on August 27 for a free virtual event that dives into IaC security best practices and real-world stories. Hear from three speakers on:

  • Taking a Platform Approach to Safer Infrastructure
  • How Tagged, Vetted Modules Can Transform IaC Security Posture
  • Securing IaC Provisioning Pipelines with PR Automation Best Practices

Register for event, join the community, and level-up your IaC practices!

Register for free

This 2-part episode of The VOID Podcast is just awesome, and well worth a listen. The conversation is framed as a retrospective of a simulated incident, with a high level of expertise and experience in the incident participants and the retrospective facilitator. I have a lot to think about, especially the discussion of overload and the four ways people react to it.

  Courtney Nash — The VOID Podcast, with guests Sarah Butt, Eric Dobbs, Alex Elman, and Hamed Silatani

Discover how tail sampling in OpenTelemetry enhances observability, reduces costs, and captures critical traces for faster detection and smarter system monitoring.

   Rishab Jolly — DZone

Datadog has evolved their time series storage through five generations before, and now they’re on the sixth. Click through to find out what motivated each change and what’s different this time around.

  Khayyam Guliyev, Duarte Nunes, Ming Chen, and Justin Jaffray — Datadog

Meta uses a tool to automatically estimate the risk level of a code change. They’ve used this to reduce the use of code freezes.

  Meta

The authors of Catchpoint’s SRE Report look back at their analysis and predictions related to AIOps, compared to how things are unfolding now.

  Leo Vasiliou and Denton Chikura — The New Stack

I love the approach and the level of detail in this article. They gave four LLMs access to observability data in a simulated infrastructure and asked them to troubleshoot a problem. It’s super useful to see the actual results from the LLMs.

  Lionel Palacin and Al Brown — ClickHouse

Uptime Labs goes meta by sharing the details of an incident they experienced last month, involving runaway creation of dynamic queues in RabbitMQ.

  Joe Mckevitt — Uptime Labs

I’m pretty impressed: Cloudflare published this article with a ton of detail on an incident, the day after it happened. A surge of traffic overloaded Cloudflare’s Data Center Internet connect links to AWS’s us-east-1 region.

  David Tuber, Emily Music, Bryton Herdes — Cloudflare

SRE Weekly Issue #490

A message from our sponsor, Observe, Inc.:

Built on a scalable, cost-efficient data lake, Observe delivers AI-powered observability at scale. With its context-aware Knowledge Graph and AI SRE, Observe enables Capital One, Topgolf, and Dialpad to ingest hundreds of terabytes daily and resolve issues faster—at drastically lower cost.

Learn how Observe is redefining observability for the AI era.

Catchpoint’s yearly survey is live! This time, they’ll plant a tree for each of the first 2000 respondents.

  Catchpoint

If you’re looking to build a status page, this article is for you. It gives reviews of 10 status pages and sums it up with a list of things to consider as you design yours.

  Sara Miteva — Checkly

The GCP outage on June 12 hit Cloudflare hard, and they’ve responded by redesigning their Workers KV service to eliminate the dependency on a third party cloud.

   Alex Robinson and Tyson Trautmann — Cloudflare

I found the bit about Google’s historical reasons for SRE especially interesting.

  Dave O’Connor

There’s a fascinating point in this article explaining why “eventual consistency” may sound entirely different to German speakers. It continues on to a really good explanation of what eventual consistency actually means.

  Uwe Friedrichsen

This article introduces SLI Compass, a 2D mental model to help you:

  • Quickly assess the signal/noise ratio of existing SLIs
  • Evaluate SLIs based on their cost and complexity
  • Set a direction for improving the quality of existing SLIs at a reasonable ROI

  Alex Ewerlöf

This is a really interesting failure mode for an endpoint monitoring provider.

  Tomas Koprusak — UptimeRobot

A production of Tinker Tinker Tinker, LLC Frontier Theme