SRE Weekly Issue #504

Salt is Cloudflare’s configuration management tool.

How do you find the root cause of a configuration management failure when you have a peak of hundreds of changes in 15 minutes on thousands of servers?

The result of this has been a reduction in the duration of software release delays, and an overall reduction in toilsome, repetitive triage for SRE.

  Opeyemi Onikute, Menno Bezema, Nick Rhodes — Cloudflare

In this post, I’ll give a high-level overview of what Temporal offers users, the problems we were experiencing operating Spinnaker that motivated its initial adoption at Netflix, and how Temporal helped us reduce the number of transient deployment failures at Netflix from 4% to 0.0001%.

  Jacob Meyers and Rob Zienert — Netflix

DrP provides an SDK that teams can use to define “analyzers” to perform investigations, plus post-processors to perform mitigations, notifications, and more.

  Shubham Somani, Vanish Talwar, Madhura Parikh, Chinmay Gandhi — Meta

This article goes in detail on the ways the QA folks can reskill and map their responsibilities and skills to SRE practices.

   Nidhi Sharma — DZone

“Correction of Error” is the name used by Amazon for their incident review processand there’s a lot to unpack there.

  Lorin Hocshtein

In 2019, Charity Majors came down hard on deploy freezes with an article, Friday Deploy Freezes are Exactly Like Murdering Puppies.

This one takes a more moderate approach: maybe a deploy freeze is the right choice for your organization, but you should work to understand why rather than assuming.

  Charity Majors

A piece defining the term “resilience”, with an especially interesting discussion of the inherent trade-off between efficiency and resiliency.

  Uwe Friedrichsen

Honeycomb experienced a major, extended incident in December, and they published this (extensive!) interim report. Resolution required multiple days’ worth of engineering on new functionality and procedures related to Kafka. A theme of managing employees’ energy and resources is threaded throughout the report.

  Honeycomb

SRE Weekly Issue #503

Abstraction is meant to encapsulate complexity, but when done poorly, it creates opacity—a lack of visibility into what’s actually happening under the hood.

  RoseSecurity

This article uses publicly available incident data and an open source tool to show that MTTR is not under statistical control, making it a useless metric.

  Lorin Hochstein

Why should we trust an AI SRE Agent? This article describes a kind of agent that shows its sources and provides more detail when asked.

Presumably these folks are saying their agent meets this description, but this isn’t (directly) a marketing piece (except for the last 2 sentences).

  RunLLM

The idea here is targeted load shedding, terminating tasks that are the likely cause of overload, using efficient heuristics.

  Murat Demirbas — summary

  YIGONG HU, ZEYIN ZHANG, YICHENG LIU, YILE GU, SHUANGYU LEI, and BARIS KASIKCI — original paper

Part 2 is just as good as the first, and I highly recommend reading it — along with the original Ironies of Automation paper.

  Uwe Friedrichsen

Take a deep technical dive into GitLab.com’s deployment pipeline, including progressive rollouts, Canary strategies, database migrations, and multiversion compatibility.

  John Skarbek — GitLab

A fun debugging story with an unexpected resolution, plus a discussion of broader lessons learned.

  Liam Mackie — Octopus Deploy

A review of AWS’s talk on their incident, with info about what new detail AWS shared and some key insights from the author.

  Lorin Hochstein

Cloudflare discusses what they’re doing in responsibility to their recent high-profile outages. They’re moving toward applying more structure and rigor to configuration deployments, like they already have for code deployments.

  Dane Knecht — Cloudflare

SRE Weekly Issue #502

Cloudflare reduced their cold-start rate for Workers requests through sharding and consistent hashing, with an interesting solution for load shedding.

  Harris Hancock — Cloudflare

I appreciate the way this article also shares how each of logs, metrics, traces, and alerts has its downsides, and what you can do instead. FYI, there’s also a fairly extensive product-specific second half about observabilty on Railway.

  Mahmoud Abdelwahab — Railway

I don’t often include direct product introductions like this explanation of Uptime Labs’s incident simulation platform from Adaptive Capacity Labs. I’m making an exception in this case because I feel that incident simulation has huge potential to improve reliability, and I see very few articles about it.

  John Allspaw — Adaptive Capacity Labs

IaC may bring more trouble than it solves, and it may simply move or hide complexity, according to this article.

  RoseSecurity

[…] the failure gap, which is the idea that people vastly underestimate the actual number and rate of failures that happen in the world compared to successes.

  Fred Hebert — summary

  Lauren Eskreis-Winkler, Kaitlin Woolley, Minhee Kim, and Eliana Polimeni — original paper

This one’s fun. You get to play along with the author, voting on an error handling strategy and then seeing what the author thinks and why.

  Marc Brooker

A chronicle of an sandboxed experiment in using multiple instances of Claude to investigate incidents. I like the level of detail and transparency in their experimental setup.

  Ar Hakboian — OpsWorker.ai

I have a bit of an article backlog, so note that this is about the November outage, not the more recent outage on December 5.

  Lorin Hochstein

SRE Weekly Issue #501

A message from our sponsor, Depot:

“Waiting for a runner” but the runner is online? Depot debugs three cases where symptoms misled engineers. Workflow permissions, Azure authentication, and Dependabot’s security context all caused failures that looked like infrastructure problems.

A thoughtful evaluation of current trends in AI through the lens of Lisanne Bainbridge’s classic paper, The Ironies of Automation. I really got a lot out of this one.

  Uwe Friedrichsen

They supercharged the workflow engine by rewriting it. I like the way they explained why they settled on a full rewrite and the alternative options they considered.

  Jun He, Yingyi Zhang, and Ely Spears — Netflix

This one goes deep on how to build a reliable service on unreliable parts. Can retries improve your overall reliability? What about the reliability of the retry system itself?

  Warren Parad — Authress

In this article, we’ll explore how cold-restart dependencies form, why typical recovery designs break down, and what architectural principles can help systems warm up faster after a complete outage.

  Bala Kambala

This one goes into the qualities of a good post-incident review, the definition of resilience, and a discussion of blamelessness, drawing lessons from aviation.

  Gamunu Balagalla — Uptime Labs

It would be easy to blame the poor outcome of BOAC 712’s engine failure on human error since the pilots missed key steps in their checklists. Instead, the NTSB cited systemic issues, resulting in improvements in checklists and other areas.

  Mentour Pilot

Cloudflare had another significant outage, though not as big as the one last month. This one was related to steps they took to mitigate the big React RCE vulnerability.

  Dane Knecht — Cloudflare

Lorin’s whole analysis is awesome, but there’s an especially incisive section at the end that uses math to put Cloudflare’s run of 2 recent big incidents in perspective.

  Lorin Hochstein

SRE Weekly Issue #500

A message from our sponsor, Depot:

Stop hunting through GitHub Actions logs. Depot now offers powerful CI log search across all your repositories and workflows. With smart filtering by timeframe, runner type, and keywords, you’ll have all the information at your fingertips to debug faster.

Wow, five hundred issues! I sent the first issue of SRE Weekly out almost exactly ten years ago. I assumed my little experiment would fairly quickly come to an end when I exhausted the supply of SRE-related articles.

I needn’t have worried. Somehow, the authors I’ve featured here have continued to produce a seemingly endless stream of excellent articles. If anything, the pace has only picked up over time! A profound thank you to all of the authors, without whom this newsletter would be just an empty bulleted list.

And thanks to you, dear readers, for making this worthwhile. Thanks for sharing the articles you find or write, I love receiving them! Thanks for the notes you send after an issue you particularly like, and the corrections too. Thanks for your kind well-wishes for my recent surgery, they meant a ton.

Finally, thanks to my sponsors, whose support makes all this possible. If you see something interesting, please give it a click and check it out!

When a scale-up event actually causes increased resource usage for awhile, a standard auto-scaling algorithm can fail.

   Minh Nhat Nguyen, Shi Kai Ng, and Calvin Tran — Grab

A database schema change added an index on a large table without using the CONCURRENTLY option, locking the table. This reminds me of a similar incident when I worked for Honeycomb and their solution.

  Ray Chen — Railway

Oof, that’s a harsh title, but this is a great discussion of how we strive to design for reliability even when our downstream vendors have outages.

  Uwe Friedrichsen

This one has a lot of good recommendations for staff-level SREs covering 8 areas, shared by a former Staff SRE.

  Karan Nagarajagowda

A high-throughput Java service was stalling. The culprit? Stop-the-World GC pauses were blocked by synchronous log writes to a busy disk.

   Nataraj Mocherla — DZone

This air accident report video by Mentour Pilot has a great example of alert fatigue around 30 minutes in. The air traffic controllers received enough spurious conflict alerts every day that they became easy to ignore.

  Mentour Pilot

In this post you learn:
* What are emergent properties and what kind of system has them?
* What are weak and strong emergence as opposed to resultant properties?
* How do emergent properties impact the reliability, maintainability, predictability, and cost of the system?

Well worth a read. It really got me thinking about emergence and its relationship to reliability.

  Alex Ewerlöf

In an incident, it’s important to have someone be in charge — and for it to be clear who that is, as explained in this article.

  Joe Mckevitt — Uptime Labs

A production of Tinker Tinker Tinker, LLC Frontier Theme