SRE Weekly Issue #496

The hidden trade-offs of fine-grained progressive rollouts

Progressive rollouts may seem like a great strategy to reduce risk, but this article explains some hidden difficulties. For example, a slow rollout can obscure a problem or make it more difficult to detect.

Lorin Hochstein

Go and enhance your calm- demolishing an HTTP:2 interop problem

A fun HTTP/2 debugging journey, complete with a somewhat ridiculous solution: read the don’t forget to zero-length response body.

Lucas Pardue and Zak Cutner — Cloudflare

Stop Reactive Network Troubleshooting: Monitor These 5 Metrics to Prevent Downtime

I know that title sounds like a Listicle, but I can tell that this list of canary metrics came from hard-won experience.

Sascha Neumeier — DZone

From Signals to Reliability: SLOs, Runbooks and Post-Mortems

This post focuses on the human systems that turn observability into reliability. You’ll see how to define SLOs that drive decisions, build runbooks that scale team knowledge, structure post-mortems that generate improvements and embed these practices into engineering culture without adding bureaucracy.

Fatih Koç

SRE math every engineer should know: a practical guide

You don’t have to be a mathematician, but understanding a few key concepts is critical for an SRE.

Srivatsa RV — One2N

LLMs Broke the SRE Runbook. Now What?

Outputs are non-deterministic, data pipelines shift underfoot, and key components behave like black boxes. As a result, many of the tools and rituals SREs have mastered for decades no longer map cleanly to production AI.

This is a summary of a panel discussion from SREcon EMEA 2025 on how SREs can adapt to LLMs.

Sylvain Kalache — The New Stack

Trixter: A Chaos Proxy for Simulating Network Faults

This nifty tool lets you to inject all sorts of faults into a TCP stream and see what happens. It’s in userland, so it’s much easier to use than Linux’s traffic shaper.

Viacheslav Biriukov

Your Brain on Incidents

This one starts with an on-call horror story, but fortunately it also has useful tips for improving on-call health.

Stuart Rimell — Uptime Labs

SRE Weekly Issue #496

Subscribe

RSS

Mastodon

Search Issues

A message from our sponsor, CodeRabbit:

Subscribe

RSS

Mastodon

Search Issues