SRE Weekly Issue #522

Incident status updates are a translation problem, and the right translator probably isn’t in Engineering

[…] the fix isn’t “train your engineers to write better status updates.” The fix is to stop asking your engineers to write them, and start asking the right people instead.

Brent Chapman

Scaling Security Insights: how we achieved a 10x increase in global scanning capacity

A satisfying scaling story where every fix came from looking more closely at the system — Kafka head-of-line blocking, a clumpy scheduler, and an active-active API that silently doubled latency for half of all partitions.

Dave Baxter — Cloudflare

Vibe-Coded Infra Is Your New Reliability Hazard

Some good examples of risks in here, along with an interesting tendency to blame “user error”.

Prakshal Doshi — HackerNoon

An incident response playbook for satellite operations on AWS (Part-1): Detection and forensic readiness

Satellites present unique reliability constraints like limited data uplink windows and the risk of bricking a very expensive piece of equipment.

Author:

Incident Fest ’26

This looks fun! It’s a free virtual event on July 8.

Uptime Labs

The feedback loops behind Kubernetes

This article does a really great job of building up an explanation of feedback-based control and the difference between edge-triggered and level-triggered systems.

Fatih Arslan — PlanetScale

Dear researchers column

An open letter to software researchers to study incident response in software systems. It’s so cool how the author translates incident response concepts to researchers who may not be familiar, with examples.

Lorin Hochstein

Meet Alice. Alice is impatient.

An important concept: a user’s perception of your average outage duration is weighted and won’t match a flat average MTTR.

Marc Brooker

SRE Weekly Issue #522

Subscribe

RSS

Mastodon

Search Issues

A message from our sponsor, Bronto:

Subscribe

RSS

Mastodon

Search Issues