SRE Weekly Issue #505

A message from our sponsor, Hopp:

Paging at 2am? 🚨 Make incident triage feel like you’re at the same keyboard with Hopp.

  • crisp, readable screen-sharing
  • no more “can you zoom in?”
  • click + type together
  • bring the incident bridge into one session

Start pair programming: https://www.gethopp.app/?via=sreweekly

An incident write-up from the archives, and it’s a juicy one. An update to their code caused a crash only after some time had passed, so their automated testing didn’t catch it before they deployed it worldwide.

  Xandr

This article covers an independent review of the Optus outage.

I personally find it astounding that somebody conducting an incident investigation would not delve deeper into how a decision that appears to be astounding would have made sense in the moment.

  Lorin Hochstein

Cloudflare needed a tool to look for overlapping impact across their many maintenance events in order to avoid unintentionally impairing redundancy.

  Kevin Deems and Michael Hoffmann — Cloudflare

Another great piece on expiration dates. I especially like the discussion of abrupt cliffs as a design choice.

  Chris Siebenmann — University of Toronto

It’s not always easy to see how to automate a given bit of toil, especially when cross-team interactions are involved.

  Thomas A. Limoncelli and Christian Pearce — ACM Queue

How do resilience and fault tolerance relate? Are they synonyms, do they overlap, or does one contain the other?

  Uwe Friedrichsen

After unexpectedly losing their observability vendor, these folks were able to migrate to a new solution within a couple days.

  Karan Abrol, Yating Zhou, Pratyush Verma, Aditya Bhandari, and Sameer Agarwal — Deductive.ai

A great dive into what blameless incident analysis really means.

Blameless also doesn’t mean you stop talking about what people did.

  Busra Koken

Updated: January 11, 2026 — 8:42 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme