SRE WEEKLY – Page 5 – scalability, availability, incident response, automation

SRE Weekly Issue #481

lex

June 15, 2025

Google Cloud Platform Incident, June 12, 2025

On Thursday, GCP had a major incident, returning 500 errors for many services worldwide. Click through for Google’s incident report.

Google

Cloudflare service outage June 12, 2025

Cloudflare’s KV service has a dependency on GCP, and Cloudflare posted this report on their incident.

Jeremy Hartman and CJ Desai — Cloudflare

Quick takes on the GCP public incident write-up

Lorin Hochstein’s perspective on an incident report often makes me see things I didn’t in my first pass.

Lorin Hochstein

Too Soon or Too Late: The Incident Escalation Dilemma

Should you escalate early or avoid pulling folks in unless absolutely necessary? This article goes into these questions and beyond, delving into the definition and purpose of escalation.

Hamed Silatani — Uptime Labs

AI Reliability Engineering: Welcome to the Third Age of SRE

How do we ensure the reliability of an LLM-based system? Can we apply traditional SRE principles and techniques to AI? This article gave me a lot to think about.

Denys Vasyliev — The New Stack

Handling Network Throttling with AWS EC2 at Pinterest

In this blog post, we’ll discuss our experiences in identifying the challenges associated with EC2 network throttling. We’ll also delve into how we developed network performance monitoring for the Pinterest EC2 fleet and discuss various techniques we implemented to manage network bursts, ensuring dependable network performance for our critical online serving workloads.

Jia Zhan and Sachin Holla — Pinterest

Beyond High Availability: Disaster Recovery Architectures That Keep Running When HA Fails

High Availability keeps things stable in small failures. DR is the safety net for large-scale disasters.

After explaining why HA by itself isn’t enough, this article covers strategies, costs, and best practices for disaster recovery.

Yakaiah Bommishetti — HackerNoon

Who the Hell is Going to Pay For This?

This article explains how observability costs can ramp up quickly, especially if we’re not careful about what data we store.

There’s a lot of nuance here, and the author posted this followup the next day after receiving many responses.

Leon Adato

SRE Weekly Issue #480

lex

June 8, 2025

General

Comments

View on sreweekly.com

You can’t prevent your last outage, no matter how hard you try

the idea that the highest ROI for risk reduction work is in the highest severity incidents is not a fact, it’s a hypothesis that simply isn’t supported by data.

Lorin Hochstein

Is Fewer Incidents Always Good?

Incidents are bad, so should we try to have fewer of them? This article challenges the assumptions contained within that goal and suggests other ways to frame one’s thinking.

Hamed Silatani — Uptime Labs.

Understanding and optimizing resource consumption in Prometheus

This guide goes deeply into the details of how Prometheus uses memory, and then it shows you how to get a handle on it.

Vladimir Guryanov — Palark

From DNS Failures to Resilience: How NodeLocal DNSCache Saved the Day

This article discusses the DNS-related challenges encountered at Mercari on our Kubernetes clusters and the significant improvements achieved by implementing Node-Local DNS Cache.

Satyadarshi Sanu — Mercari

Paxos vs. Raft and Modern Implementations

In this post we’ll explore the fundamentals of distributed consensus, compare the dominant consensus algorithms Paxos and Raft, and examine recent implementations like Kafka Raft.

Narendra Reddy Sanikommu — DEV

Improving platform resilience at Cash App

A discussion of two techniques the folks at Cash App used to improve their reliability: adopting a two-cluster topology with Kubernetes, and using Amazon’s Fault Injection Service to simulate the failure of an availability zone.

Dustin Ellis, Deepak Garg, Ben Apprederisse, Jan Zantinge, and Rachel Sheikh — Amazon

Identifying Cacheable Queries: Using tools like pt-query-digest or the MySQL sys schema to pinpoint queries that would benefit from caching

Reading this one taught me a couple of techniques I wasn’t aware of for finding queries in need of optimization in MySQL.

Vinicius Grippa — Readyset

Incident Report: June 6th, 2025

Ouch — and a great learning opportunity for all of us:

When our backend circuit breakers triggered, aggressive websocket reconnect logic initiated on every connected client at once, further overwhelming an already stressed database.

Jake Cooper — Railway

SRE Weekly Issue #479

lex

June 1, 2025

General

Comments

View on sreweekly.com

Automatic rollbacks are a last resort

Rollbacks don’t always return you to a previous system state. They can return you to a state you’ve never tested or operated before.

Steve Fenton — Octopus Deploy

Burn rate is a better error rate

This article explains the math of burn rate alerting and gives well thought out reasoning or why burn rates are better.

James Frullo — Datadog

Is There A Purpose In Assigning Incident Severity?

This hot take is worth thinking about: what do you want to get out of assigning incident severity levels, and is it working?

Hamed Silatani — Uptime Labs

In defence of deployment freezes

Less defense, and more about how to best cope with a code freeze and avoid the downsides when you’ve got no choice.

Tom Elliott

012: The MTTI Manifesto

MTTI in this case is Mean Time to Isolate. How long are you taking to figure out what system component is at the heart of an incident? What does MTTI say about your system, and what can you do about it?

Old School Burke

Is AI-assisted coding an incident magnet?

This article doesn’t answer the question in its title concretely, but it does give one a lot to think about. It also shares some ideas for how to cope with the potential challenges identified.

Sylvain Kalache — LeadDev

Not causal chains, but interactions and adaptations

This one starts off as a review of a workbook on root cause analysis by the UK Health and Safety Executive. Then it raises concerns about RCA-based reasoning and contrasts with a different model based on resilience engineering.

Lorin Hochstein

On Azure’s new SRE Agent

I wrote this article in response to Azure’s post, Introducing Azure SRE Agent. There’s a lot we can learn from the example agent interactions that Microsoft chose to share.

Lex Neva

SRE Weekly Issue #478

lex

May 25, 2025

General

Comments

View on sreweekly.com

Security and SRE: How Datadog’s combined approach aims to tackle security and reliability challenges

Datadog has fully merged their SRE and Security teams.

In this post, we’ll look at essential elements of SRE and security, the benefits we’ve realized by combining the two disciplines, and what that approach looks like for us.

Bianca Lankford — Datadog

What I Really Mean When I Say “Good Communication” in Incident Response

I love the way this article describes three different audiences for your communication during incidents. It describes what each audience is looking for and gives both positive and negative examples of how to communicate with them.

Hamed Silatani — Uptime Labs

Load testing: Prepare for the growth you dream of!

My favorite part of this article is the section on where to run your load tests: production, staging, or something else?

Tom Elliot

Working on Complex Systems: What I Learned Working at Google

What is complexity? This article gives a clear definition and breaks down the qualities one can find in a complex system. Then it goes over various methods of dealing with that complexity.

Teiva Harsanyi — The Coder Cafe

QUIC restarts, slow problems: udpgrm to the rescue

Cloudflare has a history of doing some pretty interesting things with sockets in Linux — and taking us along for the journey with highly-detailed explanations. This article is no exception, sharing the unique challenges encountered when restarting processes that handle UDP streams.

Marek Majkowski

Do not deploy on Friday!

This article examines the standard friday deploy prohibition and ultimately pushes back.

Ok… but why not?

Adrien Guéret — OpenClassrooms

Google SREs are changing the game again: a breakdown of their new approach

This article introduces the STAMP (System-Theoretic Accident Model and Processes) framework being adopted at Google, after first explaining the shortcomings in traditional SRE practices that prompted Google to adopt STAMP.

Jorge Lainfiesta — Rootly

Labeling a root cause is predicting the future, poorly

I really love this framing of what’s wrong with picking a single root cause.

Lorin Hochstein

SRE Weekly Issue #477

lex

May 18, 2025

General

Comments

View on sreweekly.com

Human Error Strikes Again… or Does It?

Why don’t we look for the root cause of a successful outcome?

Hamed Silatani — Uptime Labs

How we optimized LLM use for cost, quality, and safety to facilitate writing postmortems

They took a great deal of care to avoid the potential pitfalls of using an LLM in this way, and they share a lot of detail about the steps they took.

Tran Le, Till Pieper, and Gillian McGarvey — Datadog

When incident heroics are too heroic: the “bigger problems” limit

After dealing with a late-night outage with surprisingly small impact, I got thinking about how you would know if you were working too hard to guarantee uptime.

Tom Elliott

The 4 R’s of Pipeline Reliability: Data Systems That Last

In this article, learn how the 4 R’s — robust architecture, resumability, recoverability, and redundancy — enhance reliability in AI and ML data pipelines.

Sidhant bendre — DZone

Upgrading ECK Operator: A Side-by-Side Kubernetes Operator Upgrade Approach

In this article, I’ll delve into the challenges we encountered and the strategies we employed to manage operator upgrades for stateful workloads like Elasticsearch. Additionally, I’ll detail how we modified the ECK [Elastic Cloud on Kubernetes] operator to facilitate a more resilient side-by-side upgrade process.

Abhishek Munagekar — Mercari

Observability 2025: Navigating Costs, Complexity, and The Rise of AI

In this piece, I’ll delve into four macro challenges facing observability today, explore strategies that are emerging across the industry to address them, and offer my perspective on the trajectory of this crucial domain in the year to come.

Andrew Mallaband

Zero-Touch Bare Metal at Scale

A deep-dive into a pretty nifty system for enumerating and provisioning a rack of servers, complete with PXE-based Debian headless installation using an auto-generated preseed file. It also uses Claude to figure out what state a server is in from a screenshot obtained from the BMC.

Charith Amarasinghe — Railway

What is Koreo?

Koreo is a new open source tool for orchestrating Kubernetes infrastructure at a higher level than standard tools like Helm.

Koreo is a fairly complex tool, so it can be difficult to quickly grasp just what exactly it is, what problems it’s designed to solve, and how it compares to other, similar tools. In this post, I want to dive into these topics and also discuss the original motivation behind Koreo.

Tyler Treat

On work processes and outcomes

This one is about understanding how work actually happens in our sociotechnical systems (versus how we imagine it). This has implications for how we learn from incidents and how we design corrective actions.

Lorin Hochstein

SRE Weekly Issue #481

SRE Weekly Issue #480

SRE Weekly Issue #479

SRE Weekly Issue #478

SRE Weekly Issue #477

Subscribe

RSS

Mastodon

Search Issues

A message from our sponsor, PagerDuty:

A message from our sponsor, PagerDuty:

Subscribe

RSS

Mastodon

Search Issues