SRE WEEKLY – Page 4 – scalability, availability, incident response, automation

SRE Weekly Issue #506

lex

January 18, 2026

What came first- the CNAME or the A record

I didn’t know that some resolvers care about the order of some DNS records in a response, but I’m not surprised. The DNS spec, despite its age and multiple revisions, has a number of ambiguities like this.

Sebastiaan Neuteboom — Cloudflare

Another way to rate incidents

Severity isn’t always the best indicator of the incidents we can learn the most from. What if we rate our incidents on their potential for learning?

Lorin Hochstein

The Biggest Time Sinks During Outages

This one discusses three ways you can lose time in incidents and ideas for what you can do about it.

Hrishikesh Barua — Uptime Labs

We default to addition

An interesting discussion of a bias: we tend to solve problems by adding things to our systems, and that increases complexity. AI can amplify this bias.

Uwe Friedrichsen

BTS of OpenTelemetry Auto-instrumentation

Ever wondered how OTel auto-instrumentation works? This article explains it in detail (with code examples) for Python, Java, and Go.

Elizabeth — Observability Real Talk

How we built an AI SRE agent that investigates like a team of engineers

This article stands out from others about AI SRE agents because it goes into some detail on their method for evaluating whether their agent works. I’d love to see more of the actual evaluation results, and examples of it getting things right vs wrong.

Daniel Shan and Tristan Ratchford — Datadog

When protections outlive their purpose: A lesson on managing defense systems at scale

I recently got an error from GitHub saying I’d exceeded a rate limit (when I definitely didn’t), and this article explains why.

See why observability and lifecycle management are critical for defense systems.

Thomas Kjær Aabo — GitHub

More telemetry makes reliability worse (until you fix the loop)

Poor telemetry makes us want to add more telemetry, which can decrease our telemetry quality and make us add more, yikes! How can we fix the feedback loop?

Note for blind or low-vision readers: there’s a pretty important diagram in this one without a caption or alt text.

Ash Patel

SRE Weekly Issue #505

lex

January 11, 2026

General

Comments

View on sreweekly.com

2013–09–17 Outage Postmortem

An incident write-up from the archives, and it’s a juicy one. An update to their code caused a crash only after some time had passed, so their automated testing didn’t catch it before they deployed it worldwide.

Xandr

Quick takes on the Triple Zero Outage at Optus – the Schott Review

This article covers an independent review of the Optus outage.

I personally find it astounding that somebody conducting an incident investigation would not delve deeper into how a decision that appears to be astounding would have made sense in the moment.

Lorin Hochstein

How Workers powers our internal maintenance scheduling pipeline

Cloudflare needed a tool to look for overlapping impact across their many maintenance events in order to avoid unintentionally impairing redundancy.

Kevin Deems and Michael Hoffmann — Cloudflare

Expiry times are dangerous, on “The dangers of SSL certificates”

Another great piece on expiration dates. I especially like the discussion of abrupt cliffs as a design choice.

Chris Siebenmann — University of Toronto

SRE Is Anti-Transactional: An API for interfacing with automaters

It’s not always easy to see how to automate a given bit of toil, especially when cross-team interactions are involved.

Thomas A. Limoncelli and Christian Pearce — ACM Queue

Resilience vs. Fault tolerance

How do resilience and fault tolerance relate? Are they synonyms, do they overlap, or does one contain the other?

Uwe Friedrichsen

Datadog, Thank You for Blocking Us: Why Vendor Lock-In No Longer Matters

After unexpectedly losing their observability vendor, these folks were able to migrate to a new solution within a couple days.

Karan Abrol, Yating Zhou, Pratyush Verma, Aditya Bhandari, and Sameer Agarwal — Deductive.ai

You Can’t Debug a System by Blaming a Person

A great dive into what blameless incident analysis really means.

Blameless also doesn’t mean you stop talking about what people did.

Busra Koken

SRE Weekly Issue #504

lex

January 4, 2026

General

Comments

View on sreweekly.com

Finding the grain of sand in a heap of Salt

Salt is Cloudflare’s configuration management tool.

How do you find the root cause of a configuration management failure when you have a peak of hundreds of changes in 15 minutes on thousands of servers?

The result of this has been a reduction in the duration of software release delays, and an overall reduction in toilsome, repetitive triage for SRE.

Opeyemi Onikute, Menno Bezema, Nick Rhodes — Cloudflare

How Temporal Powers Reliable Cloud Operations at Netflix

In this post, I’ll give a high-level overview of what Temporal offers users, the problems we were experiencing operating Spinnaker that motivated its initial adoption at Netflix, and how Temporal helped us reduce the number of transient deployment failures at Netflix from 4% to 0.0001%.

Jacob Meyers and Rob Zienert — Netflix

DrP: Meta’s Root Cause Analysis Platform at Scale

DrP provides an SDK that teams can use to define “analyzers” to perform investigations, plus post-processors to perform mitigations, notifications, and more.

Shubham Somani, Vanish Talwar, Madhura Parikh, Chinmay Gandhi — Meta

Rethinking QA: From DevOps to Platform Engineering and SRE

This article goes in detail on the ways the QA folks can reskill and map their responsibilities and skills to SRE practices.

Nidhi Sharma — DZone

Why I don’t like “Correction of Error”

“Correction of Error” is the name used by Amazon for their incident review processand there’s a lot to unpack there.

Lorin Hocshtein

On Friday Deploys: Sometimes that Puppy Needs Murdering

In 2019, Charity Majors came down hard on deploy freezes with an article, Friday Deploy Freezes are Exactly Like Murdering Puppies.

This one takes a more moderate approach: maybe a deploy freeze is the right choice for your organization, but you should work to understand why rather than assuming.

Charity Majors

Resilience

A piece defining the term “resilience”, with an especially interesting discussion of the inherent trade-off between efficiency and resiliency.

Uwe Friedrichsen

Querying and Ingest issues in EU

Honeycomb experienced a major, extended incident in December, and they published this (extensive!) interim report. Resolution required multiple days’ worth of engineering on new functionality and procedures related to Kafka. A theme of managing employees’ energy and resources is threaded throughout the report.

Honeycomb

SRE Weekly Issue #503

lex

December 28, 2025

General

Comments

View on sreweekly.com

The Abstraction Debt in Infrastructure as Code

Abstraction is meant to encapsulate complexity, but when done poorly, it creates opacity—a lack of visibility into what’s actually happening under the hood.

RoseSecurity

Fun with incident data and statistical process control

This article uses publicly available incident data and an open source tool to show that MTTR is not under statistical control, making it a useless metric.

Lorin Hochstein

The Glass Box AI SRE

Why should we trust an AI SRE Agent? This article describes a kind of agent that shows its sources and provides more detail when asked.

Presumably these folks are saying their agent meets this description, but this isn’t (directly) a marketing piece (except for the last 2 sentences).

RunLLM

Mitigating Application Resource Overload with Targeted Task Cancellation

The idea here is targeted load shedding, terminating tasks that are the likely cause of overload, using efficient heuristics.

Murat Demirbas — summary

YIGONG HU, ZEYIN ZHANG, YICHENG LIU, YILE GU, SHUANGYU LEI, and BARIS KASIKCI — original paper

AI and the ironies of automation – Part 2

Part 2 is just as good as the first, and I highly recommend reading it — along with the original Ironies of Automation paper.

Uwe Friedrichsen

Deploying the world’s largest GitLab instance 12 times daily

Take a deep technical dive into GitLab.com’s deployment pipeline, including progressive rollouts, Canary strategies, database migrations, and multiversion compatibility.

John Skarbek — GitLab

It works on my cluster: a tale of two troubleshooters

A fun debugging story with an unexpected resolution, plus a discussion of broader lessons learned.

Liam Mackie — Octopus Deploy

AWS re:Invent talk on their Oct ’25 incident

A review of AWS’s talk on their incident, with info about what new detail AWS shared and some key insights from the author.

Lorin Hochstein

Code Orange: Fail Small — our resilience plan following recent incidents

Cloudflare discusses what they’re doing in responsibility to their recent high-profile outages. They’re moving toward applying more structure and rigor to configuration deployments, like they already have for code deployments.

Dane Knecht — Cloudflare

SRE Weekly Issue #502

lex

December 21, 2025

General

Comments

View on sreweekly.com

Eliminating Cold Starts 2: shard and conquer

Cloudflare reduced their cold-start rate for Workers requests through sharding and consistent hashing, with an interesting solution for load shedding.

Harris Hancock — Cloudflare

Monitoring & Observability: Using Logs, Metrics, Traces, and Alerts to Understand System Failures

I appreciate the way this article also shares how each of logs, metrics, traces, and alerts has its downsides, and what you can do instead. FYI, there’s also a fairly extensive product-specific second half about observabilty on Railway.

Mahmoud Abdelwahab — Railway

Uptime Labs: Building Expertise in Incident Response

I don’t often include direct product introductions like this explanation of Uptime Labs’s incident simulation platform from Adaptive Capacity Labs. I’m making an exception in this case because I feel that incident simulation has huge potential to improve reliability, and I see very few articles about it.

John Allspaw — Adaptive Capacity Labs

KISS vs DRY in Infrastructure as Code: Why Simple Often Beats Clever

IaC may bring more trouble than it solves, and it may simply move or hide complexity, according to this article.

RoseSecurity

Paper: The Failure Gap

[…] the failure gap, which is the idea that people vastly underestimate the actual number and rate of failures that happen in the world compared to successes.

Fred Hebert — summary

Lauren Eskreis-Winkler, Kaitlin Woolley, Minhee Kim, and Eliana Polimeni — original paper

What Now? Handling Errors in Large Systems

This one’s fun. You get to play along with the author, voting on an error handling strategy and then seeing what the author thinks and why.

Marc Brooker

Agent-Driven SRE Investigations: A Practical Deep Dive into Multi-Agent Incident Response

A chronicle of an sandboxed experiment in using multiple instances of Claude to investigate incidents. I like the level of detail and transparency in their experimental setup.

Ar Hakboian — OpsWorker.ai

Brief thoughts on the recent Cloudflare outage

I have a bit of an article backlog, so note that this is about the November outage, not the more recent outage on December 5.

Lorin Hochstein

SRE Weekly Issue #506

SRE Weekly Issue #505

SRE Weekly Issue #504

SRE Weekly Issue #503

SRE Weekly Issue #502

Subscribe

RSS

Mastodon

Search Issues

A message from our sponsor, Costory:

A message from our sponsor, Hopp:

Subscribe

RSS

Mastodon

Search Issues