SRE Weekly Issue #521

A message from our sponsor, Bronto:

Stuck with slow queries and scattered logs?

What if you could easily retain all of your telemetry data in one place for a full year without sky-high bills?

Now with Bronto, it’s possible. Connect the dots faster across TBs of always hot, full fidelity data.

Try Bronto today 🦕

Spontaneous swarming of responders might seem like a nuisance that breaks our tidy mental models of incident response, but it’s actually very powerful. It’s something to facilitate and encourage, not simply tolerate.

  Brent Chapman

The misconception is that the local assurances automatically combine to form a single end-to-end promise that spans brokers, processors, databases, outboxes, caches, webhooks, and external APIs.

   Irullappan irulandi — DZone

When a firmware issue caused reboots for firmware upgrades to take four hours(!), they had to find a solution.

  Giovanni Pereira Zantedeschi, Nnamdi Ajah, and Omar Sheik-Omar — Cloudflare

This one strikes a balance on AI that really speaks to me.

If you’re the one left holding the bag, you should generally get final say over what goes in that bag.

  Charity Majors

How Airbnb built a Kubernetes sidecar to deliver dynamic configuration reliably at scale.

  Bo Teng — Airbnb

In this post, we’ll walk through how we redesigned our Kubernetes-based PostgreSQL clusters for failover safety, how we balanced durability against latency, and what we learned while validating this approach through benchmarking and failure testing.

  Shree Sampath — Datadog

The failure mode on this one is really interesting, and the bit about “infinite blast radius” caught my eye.

  Sarat Mahavratayajula ,Vijay Sagar Gullapalli — VentureBeat

I’m enjoying this series so far, and I’m looking forward to reading the rest. It’s worth starting at part 1, but part 2 can stand on its own in a pinch.

  Uwe Friedrichsen

Updated: June 14, 2026 — 9:50 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme