SRE Weekly Issue #523

A message from our sponsor, Buildkite:

More places to run, more scale to manage and maintain, usually means more blind spots; not here. Buildkite’s control plane holds the live state of every job, agent and queue, regardless of throughput size.

See what’s running, what’s waiting and why with immediate insight → https://buildkite.com/platform/pipelines/

This week, I passed on a couple of articles for the same reason: they contained images with significant text content and no alt text. I don’t always entirely skip such articles, but in this case, the content was relevant enough that I didn’t want to leave folks with screen readers behind.

I have sight, but missing alt text does cause me to stumble even still. I read the vast majority of articles for the newsletter via text-to-speech. It can be really jarring and confusing when I miss an important thread of an article because it’s in an image. I can stop and take a look, but this can be a great forcing function to remember that others may not be able to.

While I’m here, a quick addendum to last week’s issue: I failed to attribute the AWS article to its author, Harshvardhan Chunawala. Sorry, Harshvardhan!

Oh, I’ve definitely felt that pull to debug as an IC. Gotta either hand over the IC reins or, as this article recommends, find a good tech lead.

  Brent Chapman

If your three data types can’t be joined programmatically today, an AI layer on top won’t fix that; it’ll just be confused faster.

  Pruthvi Raj Seknametla — HackerNoon

In this article, we’ve compiled a selection of tips we wish we had known the first time we picked up the pager or bore the BlackBerry.

  Uptime Labs

Me too. I do so much of my learning from an incident while I’m trying to write about it.

  Lorin Hochstein

The level of candor in this one is commendable. By all rights the maintenance itself went well — the incident was in the communication leading up to it.

  Fred Hebert — Honeycomb

This deep debugging story has a satisfying ending, and I can really feel the level of effort and detective work it took to get there.

   Deanna Lam, Diretnan Domnan, and Matt Lewis

How we made custom instrumentation blazing fast, simple, and data-centric

The answer was not just to throw AI at it.

  Jean-Mark Wright

We were curious whether AI could help us safely evolve a critical production system. This post is about what worked, what didn’t, and what we learned along the way.

I like their approach: AI is a tool only; powerful but not the whole solution.

  Arnold Wakim — Datadog

SRE Weekly Issue #522

A message from our sponsor, Bronto:

What would an AI SRE choose for their observability stack?

We asked AWS DevOps Agent to run a live test comparing Bronto, Grafana Loki, and Elasticsearch against the same OpenTelemetry dataset.

Bronto scored highest (9.4/10) and was the only tool that didn’t return silent failures. Curious why?

See the full results 🦕

[…] the fix isn’t “train your engineers to write better status updates.” The fix is to stop asking your engineers to write them, and start asking the right people instead.

  Brent Chapman

A satisfying scaling story where every fix came from looking more closely at the system — Kafka head-of-line blocking, a clumpy scheduler, and an active-active API that silently doubled latency for half of all partitions.

  Dave Baxter — Cloudflare

Some good examples of risks in here, along with an interesting tendency to blame “user error”.

  Prakshal Doshi — HackerNoon

Satellites present unique reliability constraints like limited data uplink windows and the risk of bricking a very expensive piece of equipment.

Author:

This looks fun! It’s a free virtual event on July 8.

  Uptime Labs

This article does a really great job of building up an explanation of feedback-based control and the difference between edge-triggered and level-triggered systems.

  Fatih Arslan — PlanetScale

An open letter to software researchers to study incident response in software systems. It’s so cool how the author translates incident response concepts to researchers who may not be familiar, with examples.

  Lorin Hochstein

An important concept: a user’s perception of your average outage duration is weighted and won’t match a flat average MTTR.

  Marc Brooker

SRE Weekly Issue #521

A message from our sponsor, Bronto:

Stuck with slow queries and scattered logs?

What if you could easily retain all of your telemetry data in one place for a full year without sky-high bills?

Now with Bronto, it’s possible. Connect the dots faster across TBs of always hot, full fidelity data.

Try Bronto today 🦕

Spontaneous swarming of responders might seem like a nuisance that breaks our tidy mental models of incident response, but it’s actually very powerful. It’s something to facilitate and encourage, not simply tolerate.

  Brent Chapman

The misconception is that the local assurances automatically combine to form a single end-to-end promise that spans brokers, processors, databases, outboxes, caches, webhooks, and external APIs.

   Irullappan irulandi — DZone

When a firmware issue caused reboots for firmware upgrades to take four hours(!), they had to find a solution.

  Giovanni Pereira Zantedeschi, Nnamdi Ajah, and Omar Sheik-Omar — Cloudflare

This one strikes a balance on AI that really speaks to me.

If you’re the one left holding the bag, you should generally get final say over what goes in that bag.

  Charity Majors

How Airbnb built a Kubernetes sidecar to deliver dynamic configuration reliably at scale.

  Bo Teng — Airbnb

In this post, we’ll walk through how we redesigned our Kubernetes-based PostgreSQL clusters for failover safety, how we balanced durability against latency, and what we learned while validating this approach through benchmarking and failure testing.

  Shree Sampath — Datadog

The failure mode on this one is really interesting, and the bit about “infinite blast radius” caught my eye.

  Sarat Mahavratayajula ,Vijay Sagar Gullapalli — VentureBeat

I’m enjoying this series so far, and I’m looking forward to reading the rest. It’s worth starting at part 1, but part 2 can stand on its own in a pinch.

  Uwe Friedrichsen

SRE Weekly Issue #520

A message from our sponsor, BigPanda:

Your team solved this incident last month. Why is it back?

Because you fixed the symptom, not the cause. BigPanda surfaces the pattern behind repeat incidents and tells you what to fix so the next on-call doesn’t fight the same P1.

Prevent incidents proactively

We build our systems against the usage patterns of human users, but agents fundamentally change the game.

   Vineet Bhatkoti — DZone

This is an interesting lens for exploring the risks that agents can introduce.

  Sayali Patil — VentureBeat

Great discussion in the comments! There’s a lot of variance in how much time people recommend. I personally tend to lean earlier — on-call is a great way to learn, and I can always reach out if I get stuck.

  u/modern_medicine_isnt and commenters — Reddit r/sre

A great into to the concept of metastable failures — and I recommend reading the original paper as well.

  Teiva Harsanyi

The real issue is that your company has made declaring an incident costly and risky for the person who does it.

  Brent Chapman

I enjoyed learning about their deliberate architectural choice to keep their central service in a single AZ. This incident highlighted a need for a fast failover plan.

  Coinbase

I like the balance between ensuring 99.99% reliability and designing their product to encourage customers to use their platform in a way that effectively manages the 0.01% case.

Reliability is a customer experience problem

  Mike Fisher — incident.io

I’m not gonna spoil this one for you by writing a summary. Just read it, trust me.

  Lorin Hochstein

SRE Weekly Issue #519

A message from our sponsor, BigPanda:

What if you could predict which changes will cause incidents?

BigPanda analyzes every change, including ones marked safe, to surface the real risk and impact before deployment. Next time, routine changes don’t become your next P1.

See BigPanda for SREs

They give solid examples to argue that much of the learning happens during the process of writing a post-incident review.

[…] you could throw the post-incident review document away after writing it and still get the vast majority of the value out of the process.

  Brent Chapman

I really like this idea of change absorption capacity.

  Priya Gopalsamy — Stack Overflow

A useful guide that covers strategies for benchmarking, along with pitfalls to avoid.

  Ben Dicken — PlanetScale

Serverless isn’t inherently cheaper. Hidden costs add up, and at scale it’s often pricier than containers — best for sporadic, not steady workloads.

   David Iyanu Jonathan — DZone

With just under 4.5 minutes of leeway for outages per month, you have to rely on automated remediation. AI can help, but it’s not a full solution, per this article.

  Norberto Lopes — incident.io

LLMs are specifically designed to generate plausible-seeming output, and this makes reviewing especially difficult.

  Diomidis Spinellis

A breakdown of the 28-hour aws us-east-1 outage in may 2026. What caused it, what went down, and what it means for how you design your infrastructure.

  Alon Shrestha

This article has a list of common problems in incident response, and I feel like printing it and taping it to my wall.

  Karan Nagarajagowda — Uptime Labs

A production of Tinker Tinker Tinker, LLC Frontier Theme