SRE Weekly Issue #508

SRE Weekly will be going on hiatus for 6 weeks, while I’m on leave caring for my partner after her kidney transplant surgery this week. It’s incredible that the National Kidney Registry’s Paired Exchange program allowed me to donate a kidney to help her even though we don’t have matching blood types!

A message from our sponsor, Costory:

Tired of manually explaining your cloud & LLM bills?
Check our live preview to see how Costory links every cost spike to deployments, infra changes, and usage patterns. And delivers a clean summary straight in Slack.

Explore the demo

What do we miss when we have LLMs write our code for us? This article explains that one thing we can miss out on is building a mental model.

  Shayon Mukherjee

I really love this explanation of the concept of compensation.

Compensation is a very interesting mechanism in software systems because it can keep complex systems alive, but also because it can be a factor in how they quickly and unexpectedly collapse.

  Fred Hebert — Resilience in Software Foundation

When you investigate an incident and tell the story about what you found, but no one believes you because there’s no smoking gun or bad actor…

  Lorin Hochstein

To build and maintain reliable systems, organizations must align responsibility with control. This is where the Ownership TrioMandate, Knowledge, and Accountability—comes in.

  Spiros Economakis

I love when an article goes through the designs they passed over (and why) before reaching their final design, as in this one.

  Julianne Walker — Tines

If you’re unfamiliar with Docker image lazy loading like I was, this is a great primer on two options, Estargz and SOCI.

   Huong Vuong and Joseph Sahayaraj — Grab

But don’t let MTTR become the thing you’re optimising for. The goal is to build systems and processes where you’re constantly learning and improving, not systems where you’re just really efficient at fighting the same fires over and over.

  Dave O’Connor

I watched a supposedly “resilient” Multi-Region setup completely implode recently. The architecture diagram looked great – active workloads in US-East, cold standby in US-West. But when the provider had a global IAM service degradation, the whole thing became a brick.

  u/NTCTech on Reddit

Updated: February 1, 2026 — 11:01 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme