SRE WEEKLY – Page 3 – scalability, availability, incident response, automation

SRE Weekly Issue #475

lex

May 4, 2025

Anomaly Detection in Time Series Using Statistical Analysis

I haven’t seen this level of detail in an article on anomaly detection in quite awhile. Still, the math is very approachable even if you slept through stats class.

Ivan Shubin — Booking.com

A Key Incident Response Skill That Can Reduce Resolution Time

TL;DR: The Power of Knowledge Overlap in Incident Response

There’s an anecdote in this one that’s really making me think.

Hamed Silatani — Uptime Labs

Good models protect us from bad models

One of the criticisms leveled at resilience engineering is that the insights that the field generates aren’t actionable […]

This article argues that we still need the unactionable but good models, otherwise we’ll get actionable but wrong models.

Lorin Hochstein

Achieving relentless Kafka reliability at scale with the Streaming Platform

Datadog has put a lot of thought and effort into managing their massive Kafka workload. My favorite part of this article was the bit about accidentally zip-bombing themselves with highly compressible data.

Guillaume Bort — Datadog

Failover Routing for Disaster Recovery – Ensuring Your Customers Get to The Good Place

This one covers four techniques for rerouting customer traffic after a region failure using AWS’s Route 53… themed after the TV show The Good Place. It’s been quite awhile since I watched the show, but I still found the article pretty useful.

Seth Elliot — Arpio

Incident SEV scales are a waste of time

This article asks what we’re really looking to get by defining an incident severity scale, and proposes an alternative scale based on incident complexity.

Dan Slimmon

The Lost Fourth Pillar of Observability – Config Data Monitoring

I love this idea of tracking configuration changes as observability data. I’ve been through plenty of incidents in which I wish I had it.

Yevgeny Pats — CloudQuery

Building the future of resilient tech: Lessons from two decades in SRE

A short and sweet article packed with some useful nuggets. My favorite is the section near the end on timeouts.

Hemant Burman — Insights

SRE Weekly Issue #474

lex

April 27, 2025

General

Comments

View on sreweekly.com

Why do we do blameless incident reviews?

This is a truly outstanding article about blameless incident analysis! Beyond just “why”, it covers many of the pitfalls that trip people up when they try to enact a blameless culture, including questions about accountability.

fgj

Tech without us: Why there wasn’t an outage today

Here’s a good reminder that resilience in our systems is all about the humans.

Stuart Rimell

Taking out the Trash: Garbage Collection of Object Storage at Massive Scale

This article outlines WarpStream’s solution to a common problem in systems based on shared storage (like S3): cleaning up objects that are no longer needed, at scale.

Richard Artoul — WarpStream

How we structure on call rotations at Datadog

I love learning how companies structure their on-call rota. My favorite part of this one is the emphasis on keeping the manager in the rota as a feedback mechanism.

Laura de Vesine and David Lentz — Datadog

Terraform Drift Detection: How to Catch Configuration Drift

These folks continuously detect drift by running terraform plan and alerting on changes that have no corresponding commit in git.

Yugandhar Suthari

On Describing Not Explaining

It’s a troubleshooting story having nothing to do with tech, but the technique used can easily apply to your next incident.

Paige Cruz

The Dark Side of Terraform: Drifts, Chaos, and the Headaches They Bring

Some examples you may not have thought of that can lead to Terraform drift, along with an exploration of the problems drift can bring.

Saijal Shrivastava — Razorpay

Incident Report: April 23rd, 2025

Railway had an outage this week related to their control plane database, and they shared this write-up.

Ray Chen — Railway

SRE Weekly Issue #473

lex

April 20, 2025

General

Comments

View on sreweekly.com

Scaling Nextdoor’s Datastores: Part 5

In this final installment of the Scaling Nextdoor’s Datastores blog series, we detail how the Core-Services team at Nextdoor solved cache consistency challenges as part of a holistic approach to improve our database and cache scalability and usability.

I really enjoyed this whole series. Thanks, Nextdoor folks!

Slava Markeyev — Nextdoor

Turning Non-Prod Incidents into Resilience-Building Opportunities

These folks analyzed a non-production incident like it was production, including retrospective analysis and lessons learned. Best part: they share the juicy details with us!

Joe Mckevitt — UptimeLabs

How Should You Compensate Your Employees for Being On Call?

This one goes over several different models you can use to implement on-call compensation, with pros and cons for each.

Constant Fischer — PagerDuty

Evaluating MySQL Lock Scheduling Performance: CATS vs FIFO

This article shows that MySQL’s CATS algorithm offers only a small performance gain over FIFO once deadlock logging interference is removed.

My jaw involuntarily opened when I saw the graph after they commented out the logging print statements.

Bin Wang — DZone

Chaos Engineering for Microservices

In this article, I’ll walk you through how we implemented chaos engineering across our stack using Chaos Toolkit, Chaos Monkey, and Istio — with hands-on examples for Java and Node.js. If you’re exploring ways to strengthen system resilience, this guide is packed with practical insights you can apply today.

The author does not appear to have a tie to Istio. This article has a ton of code snippets to help you get started.

Prabhu Chinnasamy — DZone

Three key facts about serverless reliability

In this blog, we’ll look at three important facts about serverless reliability that teams often overlook. We’ll explain what they are, what the risks are of not addressing them, and how you can make your serverless applications more fault-tolerant.

Serverless architectures don’t guarantee reliability.

You do have control over serverless reliability.

Serverless reliability practices can benefit all platforms, not just serverless platforms.

Andre Newman — Gremlin

A Trip Down Memory Lane: How We Resolved a Memory Leak When pprof Failed Us

This Golang debugging story is a really satisfying read.

The heap profiles were very effective at telling us the allocation sites of live objects, but provided no insights into why specific objects were being retained.

Ella Chao — WarpStream

Issue with multiple Zoom Services

Zoom had an outage this week when its domain zoom.us was temporarily blocked at the TLD level due to a miscommunication between its registrar and the TLD.

Zoom

SRE Weekly Issue #472

lex

April 13, 2025

General

Comments

View on sreweekly.com

Scaling Nextdoor’s Datastores: Part 4

In this part of the Scaling Nextdoor’s Datastores blog series, we will see how the Core-Services team at Nextdoor keeps its cache consistent with database updates and avoids stale writes to the cache.

Ronak Shah — Nextdoor

Going beyond MTTx and measuring “good” incident management

Okay, if we’re not supposed to use MTTR, what metrics in incident response are better?

Chris Evans — incident.io

This article is published by my sponsor, incident.io, but their sponsorship did not influence its inclusion in this issue.

Things that go wrong with disk IO

This reminds me of the Fallacies of Distributed Computing, and it’s equally important to internalize. Disk I/O isn’t guaranteed.

Phil Eaton

Critical Step MISSED! | What Happened on Jet2 Flight 2152?!

Here’s a great example of how we can learn a ton from near misses. In this airplane incident, a slight change in the normal takeoff sequence resulted in missing a critical step. As a result of this near miss, the aviation industry still instituted changes to make this kind of problem less likely.

Mentour Pilot — YouTube

(Un)coupling in distributed systems – Part 2

In this second and final post of this little blog series, we will discuss the redundancy fallacy and the 3rd type of coupling, we need to consider in the context of remote communication, which is temporal coupling.

Uwe Friedrichsen

Model error

All of our systems have embedded models of the world. What happens when these models are wrong?

Lorin Hochstein

Three Guiding Lights on Sustaining Resilience

This article answers this question:

“If we had to choose just three things to sustain a resilient, healthy reliability culture, what would they be?”

with these three things:

Know what matters to your users, and make it really visible

Create Psychological Safety Around Failure

Let incidents update your mental models

Busra Koken

Hot Take: I Want Execs Closer to Incidents, Not Farther

Execs intruding in incidents can have a disruptive effect, which this article acknowledges with specific examples. It goes on to list some concrete and useful things execs can do to support incident response.

By the way, massive props to the Uptime Labs folks. They created an RSS feed for their blog at my request with a super-fast turnaround. Incredible!

Hamed Silatani — Uptime Labs

SRE Weekly Issue #471

lex

April 6, 2025

General

Comments

View on sreweekly.com

Models, models every where, so let’s have a think

The author of this one draws a line between their two interests of formal methods and resilience engineering, and I’m so here for it.

Lorin Hochstein

Scaling Nextdoor’s Datastores: Part 3

In this part of the Scaling Nextdoor’s Datastores blog series, we’ll explore how the Core-Services team at Nextdoor serializes database data for caching while ensuring forward and backward compatibility between the cache and application code.

Ronak Shah — Nextdoor

The State of Online Schema Migrations in MySQL

MySQL’s ALTER TABLE INPLACE has limitations and downsides, and INSTANT does too, as explained in this article.

Shlomi Noach — Planetscale

One or Two? How Many Queues?

If you have multiple different types of work in your system, a queue per type of work may be a good choice.

Bonus(?): includes a bathroom-based analogy.

Marc Brooker

The pros and cons of Lambdalith

One Lambda function per URL path? Or a monolithic function that handles multiple paths? There are benefits and drawbacks to each.

Yan Cui

Introducing Agentic CTO: executive oversight in every incident

Published on April 1.

The truth is, many incidents move faster when there’s executive oversight — a sense of urgency, pressure, and someone repeatedly asking, “What’s the ETA?”

Chris Evans — incident.io

This article is published by my sponsor, incident.io, but their sponsorship did not influence its inclusion in this issue.

AI in Incident Management: Balancing Automation & Expertise

I’m seeing a lot of echoes of Bainbridge’s Ironies of Automation in this article about AIOps and AI tooling. If AI handles most coding and incidents, how will humans handle the outliers?

Hamed Silatani — Uptime Labs

Impressions of SRECon Americas 2025

I wasn’t able to make it, so I really appreciate this recap. Sounds like SRECon was, unsurprisingly, heavily focused on AI this time around.

Niall Murphy

SRE Weekly Issue #475

SRE Weekly Issue #474

SRE Weekly Issue #473

SRE Weekly Issue #472

SRE Weekly Issue #471

Subscribe

RSS

Mastodon

Search Issues

A message from our sponsor, incident.io:

A message from our sponsor, incident.io:

A message from our sponsor, incident.io:

A message from our sponsor, incident.io:

Subscribe

RSS

Mastodon

Search Issues