SRE Weekly Issue #512

Ashby taught us we have to fight fire with fire

Improving robustness requires increasing complexity. Let’s throw more complexity at it?

I’m using this enormously complex system, an LLM, to help me solve a problem that was created by software complexity in the first place.

Lorin Hochstein

Multi-Agent System Reliability

This feels like using multiple agents as a sort of redundancy and cross-validation architecture to improve the reliability of agent output..

Alex Ewerlöf

Why End-to-End Testing Fails in Microservice Architectures

This article explains why end-to-end testing breaks down in microservice-based systems, not due to poor tooling, but because of fundamental architectural and operational mismatches.

Alok Kumar — DZone

AI-generated code ships fast, but runtime control hasn’t kept up

LaunchDarkly’s survey data show have some interesting things to say about the impact of AI.

[…] while build and deployment velocity have improved, production reliability has not.

LaunchDarkly

The Picture They Paint of You

Fred Hebert surveyed how AI coding assistants vs. AI SRE tools are marketed and found a stark divide: coding assistants are framed as partners that augment engineers, while AI SREs are framed as replacements for low-value work. The implication is that the people building and buying these tools see incident response as grunt work to be automated away — and that says a lot about how decision-makers perceive the role.

Fred Hebert

5 Incident Response Principles for CTOs

I especially like the point that incidents are leadership moments — how you respond tells your team everything about the culture you’re building. This one is aimed at CTOs, but really it’s a great reminder for anyone in a leadership role during incidents.

Joe Mckevitt — Uptime Labs

Why Retries Are More Dangerous Than Failures

There’s a really interesting bit in this one about libraries and layers of the system doing their own retries without your knowledge, magnifying retry volume.

David Iyanu Jonathan — DZone

The post-mortem problem

I like the section on what AI should and shouldn’t do. It’s important to avoid automating away the process of learning from incidents.

incident.io

SRE Weekly Issue #512

Subscribe

RSS

Mastodon

Search Issues

A message from our sponsor, Archera:

Subscribe

RSS

Mastodon

Search Issues