SRE WEEKLY – scalability, availability, incident response, automation

SRE Weekly Issue #485

lex

July 13, 2025

Migrating the Jira Database Platform to AWS Aurora

How would you migrate several million databases, with minimal impact to your users?

Atlassian allocates one Postgres database per tenant customer, with a few thousand colocated on each RDS instance. This migration story was a riveting read!

Pat Rubis — Atlassian

“What went well” is more than just a pat on the back

Here’s my claim: providing details on how things went well will reduce your future mitigation time even more than focusing on what went wrong.

Lorin Hochstein

How we built reliable log delivery to thousands of unpredictable endpoints

My favorite part of this article was the explanation of how they handle pent-up logs when a customer’s endpoint recovers, without overwhelming the endpoint.

Gabriel Reid — Datadog

Surprise Surprise: When Reality Doesn’t Read the Runbook

How do you deal with fundamental surprise? This article introduces the concept of surprise², an incident you couldn’t see coming. Click through for some strategies to handle the inevitable occasional fundamentally surprising incident.

Stuart Rimell — Uptime Labs

How We Broke the Monolith (and Kept Our Sanity): Lessons From Moving to Microservices

A team found themselves needing to switch to microservices, and they chronicled their approach and results. I really like the section on the surprises they encountered.

Shushyam Malige Sharanappa — DZone

Seventh-generation server hardware at Dropbox: our most efficient and capable architecture yet

Dropbox shares what went into the rollout of their new fleet, including careful management of heat, vibration, and power.

Eric Shobe and Jared Mednick — Dropbox

Three Mighty Alerts Supporting Hugging Face’s Production Infrastructure

In this blog post, we’ll dive into the details of three mighty alerts that play their unique role in supporting our production infrastructure, and explore how they’ve helped us maintain the high level of performance and uptime that our community relies on.

…plus one bonus alert!

Jeremy Udit — Hugging Face

Our Experience with Amazon Aurora Blue/Green Deployments

Klaviyo adopted RDS’s blue/green deployment feature to make MySQL version upgrades much less painful. In this article they share their path to blue/green deployment and their results.

Marc Dellavolpe — Klaviyo

SRE Weekly Issue #484

lex

July 6, 2025

General

Comments

View on sreweekly.com

Exact Code Search: Find code faster across repositories

This is really neat! They’ve developed a new approach to search that uses 3-letter “trigrams” rather than tokenizing words, making it especially well-suited to code search. It converts regular expressions to trigram searches behind the scenes.

Dmitry Gruzd — GitLab

Pattern machines that we don’t understand

This article about LLMs is by a regularly featured author here in the newsletter. It’s not, strictly speaking, directly SRE-related, but I really got a lot out of it, so I’m including it anyway.

Lorin Hochstein

Soft vs. Hard Dependency: A Better Way to Think About Dependencies for More Reliable Systems

This one explains the difference between a soft and hard dependency, why it matters, and how to use this information to improve reliability. I like the section on soft dependencies evolving into hard dependencies when you’re not looking.

Teiva Harsanyi — The Coder Cafe

Breaking up a monolith: How we’re unwinding a shared database at scale

In this post, we’ll walk through how we’re splitting apart our shared database into independently owned instances. We’ll explain how we defined the right boundaries, minimized risk during migrations, and built the tooling to make the process safe and scalable.

Fabiana Scala and Tali Gutman — Datadog

Big Enough to Fail

At some point, the external dependencies which our systems rely on become so tightly coupled, large, and fundamental that should those foundations inevitably fail, that blame can actually go down in response to an incident.

This thought-provoking article explores why we’re more tolerant of outages from large tech companies like Google Cloud or Salesforce, and what this means for how we think about reliability engineering and incident response.

Will Gallego

Use AWS FIS to test the resilience of self-managed Cassandra

This practical guide shows how to use AWS Fault Injection Service (FIS) to perform chaos engineering experiments on self-managed Cassandra clusters. It walks through setting up experiments to test node failure scenarios and validate that applications can properly handle database outages through connection pooling and retry mechanisms.

Hans Nesbitt and Lwanga Phillip — AWS

Building a Billing Usage Recovery System

Klaviyo shares how they built an automated recovery system to handle billing usage tracking failures. The system uses S3 for data storage and SQS for message queuing to ensure that missed usage events are automatically recovered, eliminating manual intervention and reducing customer confusion.

Kaavya Antony — Klaviyo

Taming Complexity: HelloFresh’s Playbook for Managing Large-Scale Change (Part 3/3)

Final stretch! We’ve handled people and processes, now let’s crack the code side and stitch everything together into a four-stage framework you can reuse.

SRE Weekly Issue #483

lex

June 29, 2025

General

Comments

View on sreweekly.com

The same incident never happens twice, but the patterns recur over and over

If you focus too narrowly on preventing the specific details of the last incident, you’ll fail to identify the more general patterns that will enable your future incidents.

Lorin Hochstein

Resilience vs. Robustness: Cultivating Resilience in Incident Response

An interesting thought: scaffolding our software systems to make them more robust might actually hamper our sociotechnical system’s overall resilience. I love the horticultural analogy.

Stuart Rimell — Uptime Labs

SRE2.0: No LLM Metrics, No Future: Why SRE Must Grasp LLM Evaluation Now

As LLM services become more prevalent, traditional infrastructure metrics like availability and latency are no longer sufficient on their own to measure reliability. What should we use instead?

T-sato — Mercari

Breaking to Build Better: Platform Engineering With Chaos Experiments

Here’s a primer on chaos testing in Kubernetes, including a tutorial on using CNCF’s LitmusChaos tool to perform chaos experiments in your cluster. It’s more than just a tutorial, because it covers theoretical topics like chaos testing anti-patterns.

Josephine Eskaline Joyce — DZone

How we scaled fast, reliable configuration distribution to thousands of workload containers

The problem space seems simple, but the theme here is scale: simple solutions just don’t work in an infrastructure the size of Datadog’s.

Gabriel Reid — Datadog

Taming Complexity: HelloFresh’s Playbook for Managing Large-Scale Change (Part 2/3)

This second installment focuses on operational complexity and strategic decision-making for large-scale initiatives. The article covers when to use formal programs versus working groups, how to leverage prioritization to reduce operational burden, and strategies for phased rollouts that balance technical complexity with agility.

Konstantin Rohleder — HelloFresh

“Best practices” aren’t always best for you

This article challenges the assumption that popular DevOps practices are universally beneficial, arguing that teams should evaluate whether practices like Kubernetes, SLOs, or GitOps actually solve their specific problems rather than adopting them because “everyone else does.”

Tom Elliott — The Friday Deploy

SLA vs SLO

This short post covers: * Why does this distinction matter? * An illustration to build a memorable base * Quotes from Google’s books

Alex Ewerlöf

SRE Weekly Issue #482

lex

June 22, 2025

General

Comments

View on sreweekly.com

Service Disruption on multiple Salesforce services on June 10-11, 2025

Salesforce posted an analysis of their major outage on June 10. An autmated update restarted networking, and routing rules ended up in a bad state. This is remarkably similar to Datadog’s incident in March of 2023.

Salesforce.

LLMs are weird, man

In this article, the author likens LLMs to magic, in that they’re a black box in some ways. That has implications for how we go about building reliable systems around them.

Lorin Hochstein

When Uptime Met Downtime: My Journey from Engineer to Executive (A Retrospective Commentary)

An executive learns a valuable lesson about the ways they can be useful during an incident — and ways they might inadvertently cause disruption.

Hamed Silatani — Uptime Labs

Analyzing Metastable Failures in Distributed Systems

This article is a summary of a new paper on how to figure out if your system is susceptible to metastable failure modes.

Murat Demirbas

Engineering Resilience Through Data: A Comprehensive Approach to Change Failure Rate Monitoring

This article explores how modern teams can effectively implement, track, and leverage CFR [Change Failure Rate] to drive continuous improvement in their delivery pipelines.

Saumen Biswas — DZone

Understanding the Circuit Breaker: A Key Design Pattern for Resilient Systems

A primer on the theory and practice of circuit breakers, including example code using Resilience4j.

Narendra Lakshmana gowda — DZone

Load Testing with Impulse at Airbnb

Airbnb introduces their internal load testing framework, Impulse, and shares details about how they perform load testing at scale.

Chenhao Yang — Airbnb

Taming Complexity: HelloFresh’s Playbook for Managing Large-Scale Programs (Part 1/3)

In this first of a three-part series, HelloFresh introduces their effort to manage complexity. They start by showing what they stand to gain and then introduce high-level strategies.

Konstantin Rohleder — HelloFresh

SRE Weekly Issue #481

lex

June 15, 2025

General

Comments

View on sreweekly.com

Google Cloud Platform Incident, June 12, 2025

On Thursday, GCP had a major incident, returning 500 errors for many services worldwide. Click through for Google’s incident report.

Google

Cloudflare service outage June 12, 2025

Cloudflare’s KV service has a dependency on GCP, and Cloudflare posted this report on their incident.

Jeremy Hartman and CJ Desai — Cloudflare

Quick takes on the GCP public incident write-up

Lorin Hochstein’s perspective on an incident report often makes me see things I didn’t in my first pass.

Lorin Hochstein

Too Soon or Too Late: The Incident Escalation Dilemma

Should you escalate early or avoid pulling folks in unless absolutely necessary? This article goes into these questions and beyond, delving into the definition and purpose of escalation.

Hamed Silatani — Uptime Labs

AI Reliability Engineering: Welcome to the Third Age of SRE

How do we ensure the reliability of an LLM-based system? Can we apply traditional SRE principles and techniques to AI? This article gave me a lot to think about.

Denys Vasyliev — The New Stack

Handling Network Throttling with AWS EC2 at Pinterest

In this blog post, we’ll discuss our experiences in identifying the challenges associated with EC2 network throttling. We’ll also delve into how we developed network performance monitoring for the Pinterest EC2 fleet and discuss various techniques we implemented to manage network bursts, ensuring dependable network performance for our critical online serving workloads.

Jia Zhan and Sachin Holla — Pinterest

Beyond High Availability: Disaster Recovery Architectures That Keep Running When HA Fails

High Availability keeps things stable in small failures. DR is the safety net for large-scale disasters.

After explaining why HA by itself isn’t enough, this article covers strategies, costs, and best practices for disaster recovery.

Yakaiah Bommishetti — HackerNoon

Who the Hell is Going to Pay For This?

This article explains how observability costs can ramp up quickly, especially if we’re not careful about what data we store.

There’s a lot of nuance here, and the author posted this followup the next day after receiving many responses.

Leon Adato

← Older Posts

SRE Weekly Issue #485

SRE Weekly Issue #484

SRE Weekly Issue #483

SRE Weekly Issue #482

SRE Weekly Issue #481

Subscribe

RSS

Mastodon

Search Issues

A message from our sponsor, PagerDuty:

A message from our sponsor, PagerDuty:

A message from our sponsor, PagerDuty:

Subscribe

RSS

Mastodon

Search Issues