SRE Weekly Issue #482

A message from our sponsor, PagerDuty:

Incidents move fast. But you’ll never get left behind with PagerDuty’s GenAI incident response assistant, available in all paid plans. Get instant business impact analysis, troubleshooting steps, and auto-drafted status updates—directly in Slack. Stop context-switching, start resolving faster.

https://fnf.dev/4dZ5V36

Salesforce posted an analysis of their major outage on June 10. An autmated update restarted networking, and routing rules ended up in a bad state. This is remarkably similar to Datadog’s incident in March of 2023.

  Salesforce.

In this article, the author likens LLMs to magic, in that they’re a black box in some ways. That has implications for how we go about building reliable systems around them.

  Lorin Hochstein

An executive learns a valuable lesson about the ways they can be useful during an incident — and ways they might inadvertently cause disruption.

  Hamed Silatani — Uptime Labs

This article is a summary of a new paper on how to figure out if your system is susceptible to metastable failure modes.

  Murat Demirbas

This article explores how modern teams can effectively implement, track, and leverage CFR [Change Failure Rate] to drive continuous improvement in their delivery pipelines.

   Saumen Biswas — DZone

A primer on the theory and practice of circuit breakers, including example code using Resilience4j.

   Narendra Lakshmana gowda — DZone

Airbnb introduces their internal load testing framework, Impulse, and shares details about how they perform load testing at scale.

  Chenhao Yang — Airbnb

In this first of a three-part series, HelloFresh introduces their effort to manage complexity. They start by showing what they stand to gain and then introduce high-level strategies.

  Konstantin Rohleder — HelloFresh

Updated: June 22, 2025 — 10:48 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme