SRE WEEKLY – scalability, availability, incident response, automation

SRE Weekly Issue #492

lex

August 31, 2025

r/sre: pagerduty went down and my day went straight to hell

Three days ago, PagerDuty had a major incident, severely impacting incident creation, notifications, and more. Linked above is a discussion on reddit’s r/sre with lots of takes on how folks deal with this kind of thing.

u/Secret-Menu-2121 and others

Being on the Same Page During an Incident: Not Actually Telepathy

It’s not telepathy; it’s about building common ground. This article explains what that means and the components that comprise common ground in an incident.

Stuart Rimell — Uptime Labs

Pooling Connections with RDS Proxy at Klaviyo

An introduction to database connection pooling in general, and RDS proxy in specific, complete with a Terraform snippet.

David Kraytsberg — Klaviyo

Easy will always trump simple

This article explores the difference between simple and easy, their relation to complexity, and the effect of production pressure.

Lorin Hochstein

Availability Models: Because “Highly Available” Isn’t Saying Much

What does “High Availability” actually mean? It turns out that it can mean different things to different people, and it’s important to look deeper.

Teiva Harsanyi — The Coder Cafe

Ron Gantt on incidents and blame

This short but sweet untitled LinkedIn post goes into the importance of understanding the entire context rather than focusing on an individual’s mistakes or omissions.

Ron Gantt

SLI Evolution Stages

Whether you’re just getting started implementing SLIs and SLOs or you’re a veteran, you’ll want to read this one. It charts the progress of organizations as they successively refine and mature their SLIs, and more importantly, it explains why the later stages matter.

Alex Ewerlöf

SRE Weekly Issue #491

lex

August 24, 2025

General

Comments

View on sreweekly.com

Uptime Labs and the Multi-Party Dilemma (Part I)

This 2-part episode of The VOID Podcast is just awesome, and well worth a listen. The conversation is framed as a retrospective of a simulated incident, with a high level of expertise and experience in the incident participants and the retrospective facilitator. I have a lot to think about, especially the discussion of overload and the four ways people react to it.

Courtney Nash — The VOID Podcast, with guests Sarah Butt, Eric Dobbs, Alex Elman, and Hamed Silatani

Tail Sampling: The Future of Intelligent Observability in Distributed Systems

Discover how tail sampling in OpenTelemetry enhances observability, reduces costs, and captures critical traces for faster detection and smarter system monitoring.

Rishab Jolly — DZone

Evolving our real-time timeseries storage again: Built in Rust for performance at scale

Datadog has evolved their time series storage through five generations before, and now they’re on the sixth. Click through to find out what motivated each change and what’s different this time around.

Khayyam Guliyev, Duarte Nunes, Ming Chen, and Justin Jaffray — Datadog

Diff Risk Score: AI-driven risk-aware software development

Meta uses a tool to automatically estimate the risk level of a code change. They’ve used this to reduce the use of code freezes.

SRE Weekly Issue #490

lex

August 17, 2025

General

Comments

View on sreweekly.com

SRE Survey 2025

Catchpoint’s yearly survey is live! This time, they’ll plant a tree for each of the first 2000 respondents.

Catchpoint

Top 10 Status Page Examples: What We Like and What’s Missing

If you’re looking to build a status page, this article is for you. It gives reviews of 10 status pages and sums it up with a list of things to consider as you design yours.

Sara Miteva — Checkly

Redesigning Workers KV for increased availability and faster performance

The GCP outage on June 12 hit Cloudflare hard, and they’ve responded by redesigning their Workers KV service to eliminate the dependency on a third party cloud.

Alex Robinson and Tyson Trautmann — Cloudflare

6 Reasons You Don’t Need an SRE Team

I found the bit about Google’s historical reasons for SRE especially interesting.

Dave O’Connor

A note about eventual consistency – Part 1

There’s a fascinating point in this article explaining why “eventual consistency” may sound entirely different to German speakers. It continues on to a really good explanation of what eventual consistency actually means.

Uwe Friedrichsen

SLI Compass: Fidelity and Granularity

This article introduces SLI Compass, a 2D mental model to help you:

Quickly assess the signal/noise ratio of existing SLIs

Evaluate SLIs based on their cost and complexity

Set a direction for improving the quality of existing SLIs at a reasonable ROI

Alex Ewerlöf

False positives in EU monitoring region: what happened and what we’re doing about it.

This is a really interesting failure mode for an endpoint monitoring provider.

Tomas Koprusak — UptimeRobot

SRE Weekly Issue #489

lex

August 10, 2025

General

Comments

View on sreweekly.com

Negotiating the Paradox We Face in Resilience Engineering—Lessons From an Engineering Leader

As we learn advanced resilience engineering concepts, this article recommends that we take a balanced approach in how we try to change existing practices.

I can confidently say that when an executive leader wants to be talking about quality of service for your customers, the last thing they want to hear about is academic papers and Monte Carlo simulations.

Michelle Casey — Resilience in Software Foundation

Hashing

I know you probably know all about how hashing works, but this one’s still worth a read. The article includes interactive demonstrations and clearly presents concepts to help you understand how hashing function performance is evaluated.

Sam Rose

How We Migrated the Parse API From Ruby to Golang (Resurrected)

Pulled from the Internet Archive, here’s a story of how the now-defunct Parse rewrote their Ruby on Rails API in Golang, significantly improving reliability.

Charity Majors

How Meta keeps its AI hardware reliable

We are sharing methodologies we deploy at various scales for detecting SDC [Silent Data Corruption] across our AI and non-AI infrastructure to help ensure the reliability of AI training and inference workloads across Meta.

Harish Dattatraya Dixit and Sriram Sankar — Meta

Guarding the herd – managing database servers at scale

As monday.com broke their monolith up into microservices, their number of databases expanded too. To have a chance of managing all of them, they shifted from DBA practices to DBRE.

Mateusz Wojciechowski — monday.com

Achieving High Availability with distributed database on Kubernetes at Airbnb

Airbnb runs a large-scale database on Kubernetes. They have various techniques to deal with the ephemerality of pods and the risks inherent in cluster upgrades.

Artem Danilov — Airbnb

K8sGPT for Kubernetes troubleshooting: How AI helps in different cases

The author of this article brings us along as they do a very thorough evaluation of K8sGPT, showing us what it can do and some ways in which it can fall short.

Evgeny Torin — Palark

Climbing the Communication Ladder During Incidents

What is good incident communication? This article draws on theory from Herbert Clark’s Joint Action Ladder to help us evaluate and strengthen communication.

Stuart Rimell — Uptime Labs

SRE Weekly Issue #488

lex

August 3, 2025

General

Comments

View on sreweekly.com

The Wild Story of the Taum Sauk Dam Failure

A story of the failure of a pumped energy storage facility, involving all of our favorite features like complex contributing factors, work-as-done vs work-as-designed, and early warning signs only obvious in hindsight. As a bonus, no one was killed.

Practical Engineering

Jet Lag: The Traffic

Nebula‘s streaming service has a surprisingly write-heavy workflow owing to storing bookmarks of the latest point a given user has watched in a video. That makes scaling an interesting challenge.

Sam Rose — Nebula

Debugging the One-in-a-Million Failure: Migrating Pinterest’s Search Infrastructure to Kubernetes

I love the debugging technique they used: kill processes one at a time until performance improves.

Samson Hu, Shashank Tavildar, Eric Kalkanger, and Hunter Gatewood — Pinterest

When Process Becomes Latency: Optimizing Incident Response Cadence

This article is about finding the balance between having enough process to ensure incident response goes smoothly, and having so much process that incident responders are unable to adapt to unexpected situations.

Brandon Chalk — Rootly

What Experts See That the Rest of Us Miss During Incidents

This article presents two case studies of dialog during incidents along with analysis of each. How does your own analysis compare?

Hamed Silatani — Uptime Labs

Chris’s Wiki :: blog/sysadmin/MachineRoomTempTwoSortsOfAlerts

They realized that a single alert can’t catch both a sudden AC failure and an AC that becomes slowly but steadily overwhelmed.

Chris Siebenmann

Cloudflare and the infinite sadness of migrations

Thoughts on migrations as a significant source of reliability risk.

[…] engineering organizations at tech companies need to make migrations a part of their core competency, rather than seeing them as one-off chores.

Lorin Hochstein

Google Cloud Platform incident report: July 18, 2025

An incorrect physical disconnection was made to the active network switch serving our control plane, rather than the redundant unit scheduled for removal.

This reminds me of wrong-side surgery incidents and aircraft pilots shutting off the good engine when one fails.

Google

SRE Weekly Issue #492

SRE Weekly Issue #491

SRE Weekly Issue #490

SRE Weekly Issue #489

SRE Weekly Issue #488

Subscribe

RSS

Mastodon

Search Issues

A message from our sponsor, Observe, Inc.:

A message from our sponsor, Spacelift:

A message from our sponsor, Observe, Inc.:

A message from our sponsor, Observe, Inc.:

A message from our sponsor, Observe, Inc.:

Subscribe

RSS

Mastodon

Search Issues