SRE WEEKLY – Page 3 – scalability, availability, incident response, automation

SRE Weekly Issue #478

lex

May 25, 2025

Security and SRE: How Datadog’s combined approach aims to tackle security and reliability challenges

Datadog has fully merged their SRE and Security teams.

In this post, we’ll look at essential elements of SRE and security, the benefits we’ve realized by combining the two disciplines, and what that approach looks like for us.

Bianca Lankford — Datadog

What I Really Mean When I Say “Good Communication” in Incident Response

I love the way this article describes three different audiences for your communication during incidents. It describes what each audience is looking for and gives both positive and negative examples of how to communicate with them.

Hamed Silatani — Uptime Labs

Load testing: Prepare for the growth you dream of!

My favorite part of this article is the section on where to run your load tests: production, staging, or something else?

Tom Elliot

Working on Complex Systems: What I Learned Working at Google

What is complexity? This article gives a clear definition and breaks down the qualities one can find in a complex system. Then it goes over various methods of dealing with that complexity.

Teiva Harsanyi — The Coder Cafe

QUIC restarts, slow problems: udpgrm to the rescue

Cloudflare has a history of doing some pretty interesting things with sockets in Linux — and taking us along for the journey with highly-detailed explanations. This article is no exception, sharing the unique challenges encountered when restarting processes that handle UDP streams.

Marek Majkowski

Do not deploy on Friday!

This article examines the standard friday deploy prohibition and ultimately pushes back.

Ok… but why not?

Adrien Guéret — OpenClassrooms

Google SREs are changing the game again: a breakdown of their new approach

This article introduces the STAMP (System-Theoretic Accident Model and Processes) framework being adopted at Google, after first explaining the shortcomings in traditional SRE practices that prompted Google to adopt STAMP.

Jorge Lainfiesta — Rootly

Labeling a root cause is predicting the future, poorly

I really love this framing of what’s wrong with picking a single root cause.

Lorin Hochstein

SRE Weekly Issue #477

lex

May 18, 2025

General

Comments

View on sreweekly.com

Human Error Strikes Again… or Does It?

Why don’t we look for the root cause of a successful outcome?

Hamed Silatani — Uptime Labs

How we optimized LLM use for cost, quality, and safety to facilitate writing postmortems

They took a great deal of care to avoid the potential pitfalls of using an LLM in this way, and they share a lot of detail about the steps they took.

Tran Le, Till Pieper, and Gillian McGarvey — Datadog

When incident heroics are too heroic: the “bigger problems” limit

After dealing with a late-night outage with surprisingly small impact, I got thinking about how you would know if you were working too hard to guarantee uptime.

Tom Elliott

The 4 R’s of Pipeline Reliability: Data Systems That Last

In this article, learn how the 4 R’s — robust architecture, resumability, recoverability, and redundancy — enhance reliability in AI and ML data pipelines.

Sidhant bendre — DZone

Upgrading ECK Operator: A Side-by-Side Kubernetes Operator Upgrade Approach

In this article, I’ll delve into the challenges we encountered and the strategies we employed to manage operator upgrades for stateful workloads like Elasticsearch. Additionally, I’ll detail how we modified the ECK [Elastic Cloud on Kubernetes] operator to facilitate a more resilient side-by-side upgrade process.

Abhishek Munagekar — Mercari

Observability 2025: Navigating Costs, Complexity, and The Rise of AI

In this piece, I’ll delve into four macro challenges facing observability today, explore strategies that are emerging across the industry to address them, and offer my perspective on the trajectory of this crucial domain in the year to come.

Andrew Mallaband

Zero-Touch Bare Metal at Scale

A deep-dive into a pretty nifty system for enumerating and provisioning a rack of servers, complete with PXE-based Debian headless installation using an auto-generated preseed file. It also uses Claude to figure out what state a server is in from a screenshot obtained from the BMC.

Charith Amarasinghe — Railway

What is Koreo?

Koreo is a new open source tool for orchestrating Kubernetes infrastructure at a higher level than standard tools like Helm.

Koreo is a fairly complex tool, so it can be difficult to quickly grasp just what exactly it is, what problems it’s designed to solve, and how it compares to other, similar tools. In this post, I want to dive into these topics and also discuss the original motivation behind Koreo.

Tyler Treat

On work processes and outcomes

This one is about understanding how work actually happens in our sociotechnical systems (versus how we imagine it). This has implications for how we learn from incidents and how we design corrective actions.

Lorin Hochstein

SRE Weekly Issue #476

lex

May 11, 2025

General

Comments

View on sreweekly.com

Automation and The Substitution Myth

The myth is:

The underlying and often unexamined assumption for the benefits of automation is the notion that computers/machines are better at some tasks, and humans are better at a different, non-overlapping set of tasks.

This article lays out several pitfalls to this approach, with references.

Courtney Nash

Incident Report: Spotify Outage on April 16, 2025

Wow, I seriously love this one. It’s written in an a very approachable style that’s easy to understand from the outside. It lays a series of cringe-worthy contributing factors that could happen to any of us, making them a great learning opportunity.

Spotify

Navigating Incidents with Clarity Through Grounding

This is the first time I’ve come across the term “grounding” in incident response, and I like it!

At the core of our vision lies the principle of grounding, drawn from safety-critical systems like aviation and the fire service industries. Grounding is the process of maintaining a shared understanding among team members throughout the course of an incident.

Uptime Labs

How we use formal modeling, lightweight simulations, and chaos testing to design reliable distributed systems

I really like the idea of using formal modeling on distributed systems. Datadog explains how they did it when building a new message queuing service.

Arun Parthiban, Sesh Nalla, and Cecilia Wat-Kim

The EU AI Act and what it means for managing incidents

I found this to be a really useful primer on the new EU AI regulation. It does transition into a sales pitch toward the end, but the pre-pitch content is substantial.

Chris Evans — incident.io

Incident Report: April 30th, 2025

A classic example of Lorin’s Law: work intended to improve reliability was at the heart of this incident.

Railway

Why I’m not using feature flags…yet

Feature flags are incredibly useful, but they have some gotchas too.

Tom Elliott

Don’t make these feature flag mistakes

More potential problems to watch out for with feature flags, but this one ends by emphasizing that feature flags are still an important tool. Bonus points for a Knight Capital Incident mention.

Ian Vanagas

SRE Weekly Issue #475

lex

May 4, 2025

General

Comments

View on sreweekly.com

Anomaly Detection in Time Series Using Statistical Analysis

I haven’t seen this level of detail in an article on anomaly detection in quite awhile. Still, the math is very approachable even if you slept through stats class.

Ivan Shubin — Booking.com

A Key Incident Response Skill That Can Reduce Resolution Time

TL;DR: The Power of Knowledge Overlap in Incident Response

There’s an anecdote in this one that’s really making me think.

Hamed Silatani — Uptime Labs

Good models protect us from bad models

One of the criticisms leveled at resilience engineering is that the insights that the field generates aren’t actionable […]

This article argues that we still need the unactionable but good models, otherwise we’ll get actionable but wrong models.

Lorin Hochstein

Achieving relentless Kafka reliability at scale with the Streaming Platform

Datadog has put a lot of thought and effort into managing their massive Kafka workload. My favorite part of this article was the bit about accidentally zip-bombing themselves with highly compressible data.

Guillaume Bort — Datadog

Failover Routing for Disaster Recovery – Ensuring Your Customers Get to The Good Place

This one covers four techniques for rerouting customer traffic after a region failure using AWS’s Route 53… themed after the TV show The Good Place. It’s been quite awhile since I watched the show, but I still found the article pretty useful.

Seth Elliot — Arpio

Incident SEV scales are a waste of time

This article asks what we’re really looking to get by defining an incident severity scale, and proposes an alternative scale based on incident complexity.

Dan Slimmon

The Lost Fourth Pillar of Observability – Config Data Monitoring

I love this idea of tracking configuration changes as observability data. I’ve been through plenty of incidents in which I wish I had it.

Yevgeny Pats — CloudQuery

Building the future of resilient tech: Lessons from two decades in SRE

A short and sweet article packed with some useful nuggets. My favorite is the section near the end on timeouts.

Hemant Burman — Insights

SRE Weekly Issue #474

lex

April 27, 2025

General

Comments

View on sreweekly.com

Why do we do blameless incident reviews?

This is a truly outstanding article about blameless incident analysis! Beyond just “why”, it covers many of the pitfalls that trip people up when they try to enact a blameless culture, including questions about accountability.

fgj

Tech without us: Why there wasn’t an outage today

Here’s a good reminder that resilience in our systems is all about the humans.

Stuart Rimell

Taking out the Trash: Garbage Collection of Object Storage at Massive Scale

This article outlines WarpStream’s solution to a common problem in systems based on shared storage (like S3): cleaning up objects that are no longer needed, at scale.

Richard Artoul — WarpStream

How we structure on call rotations at Datadog

I love learning how companies structure their on-call rota. My favorite part of this one is the emphasis on keeping the manager in the rota as a feedback mechanism.

Laura de Vesine and David Lentz — Datadog

Terraform Drift Detection: How to Catch Configuration Drift

These folks continuously detect drift by running terraform plan and alerting on changes that have no corresponding commit in git.

Yugandhar Suthari

On Describing Not Explaining

It’s a troubleshooting story having nothing to do with tech, but the technique used can easily apply to your next incident.

Paige Cruz

The Dark Side of Terraform: Drifts, Chaos, and the Headaches They Bring

Some examples you may not have thought of that can lead to Terraform drift, along with an exploration of the problems drift can bring.

Saijal Shrivastava — Razorpay

Incident Report: April 23rd, 2025

Railway had an outage this week related to their control plane database, and they shared this write-up.

Ray Chen — Railway

SRE Weekly Issue #478

SRE Weekly Issue #477

SRE Weekly Issue #476

SRE Weekly Issue #475

SRE Weekly Issue #474

Subscribe

RSS

Mastodon

Search Issues

A message from our sponsor, incident.io:

Subscribe

RSS

Mastodon

Search Issues