Progressive rollouts may seem like a great strategy to reduce risk, but this article explains some hidden difficulties. For example, a slow rollout can obscure a problem or make it more difficult to detect.
Lorin Hochstein
A fun HTTP/2 debugging journey, complete with a somewhat ridiculous solution: read the don’t forget to zero-length response body.
Lucas Pardue and Zak Cutner — Cloudflare
I know that title sounds like a Listicle, but I can tell that this list of canary metrics came from hard-won experience.
Sascha Neumeier — DZone
This post focuses on the human systems that turn observability into reliability. You’ll see how to define SLOs that drive decisions, build runbooks that scale team knowledge, structure post-mortems that generate improvements and embed these practices into engineering culture without adding bureaucracy.
Fatih Koç
You don’t have to be a mathematician, but understanding a few key concepts is critical for an SRE.
Srivatsa RV — One2N
Outputs are non-deterministic, data pipelines shift underfoot, and key components behave like black boxes. As a result, many of the tools and rituals SREs have mastered for decades no longer map cleanly to production AI.
This is a summary of a panel discussion from SREcon EMEA 2025 on how SREs can adapt to LLMs.
Sylvain Kalache — The New Stack
This nifty tool lets you to inject all sorts of faults into a TCP stream and see what happens. It’s in userland, so it’s much easier to use than Linux’s traffic shaper.
Viacheslav Biriukov
This one starts with an on-call horror story, but fortunately it also has useful tips for improving on-call health.
Stuart Rimell — Uptime Labs
