SRE Weekly Issue #522

A message from our sponsor, Bronto:

What would an AI SRE choose for their observability stack?

We asked AWS DevOps Agent to run a live test comparing Bronto, Grafana Loki, and Elasticsearch against the same OpenTelemetry dataset.

Bronto scored highest (9.4/10) and was the only tool that didn’t return silent failures. Curious why?

See the full results 🦕

[…] the fix isn’t “train your engineers to write better status updates.” The fix is to stop asking your engineers to write them, and start asking the right people instead.

  Brent Chapman

A satisfying scaling story where every fix came from looking more closely at the system — Kafka head-of-line blocking, a clumpy scheduler, and an active-active API that silently doubled latency for half of all partitions.

  Dave Baxter — Cloudflare

Some good examples of risks in here, along with an interesting tendency to blame “user error”.

  Prakshal Doshi — HackerNoon

Satellites present unique reliability constraints like limited data uplink windows and the risk of bricking a very expensive piece of equipment.

Author:

This looks fun! It’s a free virtual event on July 8.

  Uptime Labs

This article does a really great job of building up an explanation of feedback-based control and the difference between edge-triggered and level-triggered systems.

  Fatih Arslan — PlanetScale

An open letter to software researchers to study incident response in software systems. It’s so cool how the author translates incident response concepts to researchers who may not be familiar, with examples.

  Lorin Hochstein

An important concept: a user’s perception of your average outage duration is weighted and won’t match a flat average MTTR.

  Marc Brooker

Updated: June 21, 2026 — 10:28 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme