Salesforce posted an analysis of their major outage on June 10. An autmated update restarted networking, and routing rules ended up in a bad state. This is remarkably similar to Datadog’s incident in March of 2023.
Salesforce.
In this article, the author likens LLMs to magic, in that they’re a black box in some ways. That has implications for how we go about building reliable systems around them.
Lorin Hochstein
An executive learns a valuable lesson about the ways they can be useful during an incident — and ways they might inadvertently cause disruption.
Hamed Silatani — Uptime Labs
This article is a summary of a new paper on how to figure out if your system is susceptible to metastable failure modes.
Murat Demirbas
This article explores how modern teams can effectively implement, track, and leverage CFR [Change Failure Rate] to drive continuous improvement in their delivery pipelines.
Saumen Biswas — DZone
A primer on the theory and practice of circuit breakers, including example code using Resilience4j.
Narendra Lakshmana gowda — DZone
Airbnb introduces their internal load testing framework, Impulse, and shares details about how they perform load testing at scale.
Chenhao Yang — Airbnb
In this first of a three-part series, HelloFresh introduces their effort to manage complexity. They start by showing what they stand to gain and then introduce high-level strategies.
Konstantin Rohleder — HelloFresh