SRE Weekly Issue #482

Service Disruption on multiple Salesforce services on June 10-11, 2025

Salesforce posted an analysis of their major outage on June 10. An autmated update restarted networking, and routing rules ended up in a bad state. This is remarkably similar to Datadog’s incident in March of 2023.

Salesforce.

LLMs are weird, man

In this article, the author likens LLMs to magic, in that they’re a black box in some ways. That has implications for how we go about building reliable systems around them.

Lorin Hochstein

When Uptime Met Downtime: My Journey from Engineer to Executive (A Retrospective Commentary)

An executive learns a valuable lesson about the ways they can be useful during an incident — and ways they might inadvertently cause disruption.

Hamed Silatani — Uptime Labs

Analyzing Metastable Failures in Distributed Systems

This article is a summary of a new paper on how to figure out if your system is susceptible to metastable failure modes.

Murat Demirbas

Engineering Resilience Through Data: A Comprehensive Approach to Change Failure Rate Monitoring

This article explores how modern teams can effectively implement, track, and leverage CFR [Change Failure Rate] to drive continuous improvement in their delivery pipelines.

Saumen Biswas — DZone

Understanding the Circuit Breaker: A Key Design Pattern for Resilient Systems

A primer on the theory and practice of circuit breakers, including example code using Resilience4j.

Narendra Lakshmana gowda — DZone

Load Testing with Impulse at Airbnb

Airbnb introduces their internal load testing framework, Impulse, and shares details about how they perform load testing at scale.

Chenhao Yang — Airbnb

Taming Complexity: HelloFresh’s Playbook for Managing Large-Scale Programs (Part 1/3)

In this first of a three-part series, HelloFresh introduces their effort to manage complexity. They start by showing what they stand to gain and then introduce high-level strategies.

Konstantin Rohleder — HelloFresh

SRE Weekly Issue #482

Subscribe

RSS

Mastodon

Search Issues

A message from our sponsor, PagerDuty:

Subscribe

RSS

Mastodon

Search Issues