General

SRE Weekly Issue #502

Cloudflare reduced their cold-start rate for Workers requests through sharding and consistent hashing, with an interesting solution for load shedding.

  Harris Hancock — Cloudflare

I appreciate the way this article also shares how each of logs, metrics, traces, and alerts has its downsides, and what you can do instead. FYI, there’s also a fairly extensive product-specific second half about observabilty on Railway.

  Mahmoud Abdelwahab — Railway

I don’t often include direct product introductions like this explanation of Uptime Labs’s incident simulation platform from Adaptive Capacity Labs. I’m making an exception in this case because I feel that incident simulation has huge potential to improve reliability, and I see very few articles about it.

  John Allspaw — Adaptive Capacity Labs

IaC may bring more trouble than it solves, and it may simply move or hide complexity, according to this article.

  RoseSecurity

[…] the failure gap, which is the idea that people vastly underestimate the actual number and rate of failures that happen in the world compared to successes.

  Fred Hebert — summary

  Lauren Eskreis-Winkler, Kaitlin Woolley, Minhee Kim, and Eliana Polimeni — original paper

This one’s fun. You get to play along with the author, voting on an error handling strategy and then seeing what the author thinks and why.

  Marc Brooker

A chronicle of an sandboxed experiment in using multiple instances of Claude to investigate incidents. I like the level of detail and transparency in their experimental setup.

  Ar Hakboian — OpsWorker.ai

I have a bit of an article backlog, so note that this is about the November outage, not the more recent outage on December 5.

  Lorin Hochstein

SRE Weekly Issue #501

A message from our sponsor, Depot:

“Waiting for a runner” but the runner is online? Depot debugs three cases where symptoms misled engineers. Workflow permissions, Azure authentication, and Dependabot’s security context all caused failures that looked like infrastructure problems.

A thoughtful evaluation of current trends in AI through the lens of Lisanne Bainbridge’s classic paper, The Ironies of Automation. I really got a lot out of this one.

  Uwe Friedrichsen

They supercharged the workflow engine by rewriting it. I like the way they explained why they settled on a full rewrite and the alternative options they considered.

  Jun He, Yingyi Zhang, and Ely Spears — Netflix

This one goes deep on how to build a reliable service on unreliable parts. Can retries improve your overall reliability? What about the reliability of the retry system itself?

  Warren Parad — Authress

In this article, we’ll explore how cold-restart dependencies form, why typical recovery designs break down, and what architectural principles can help systems warm up faster after a complete outage.

  Bala Kambala

This one goes into the qualities of a good post-incident review, the definition of resilience, and a discussion of blamelessness, drawing lessons from aviation.

  Gamunu Balagalla — Uptime Labs

It would be easy to blame the poor outcome of BOAC 712’s engine failure on human error since the pilots missed key steps in their checklists. Instead, the NTSB cited systemic issues, resulting in improvements in checklists and other areas.

  Mentour Pilot

Cloudflare had another significant outage, though not as big as the one last month. This one was related to steps they took to mitigate the big React RCE vulnerability.

  Dane Knecht — Cloudflare

Lorin’s whole analysis is awesome, but there’s an especially incisive section at the end that uses math to put Cloudflare’s run of 2 recent big incidents in perspective.

  Lorin Hochstein

SRE Weekly Issue #500

A message from our sponsor, Depot:

Stop hunting through GitHub Actions logs. Depot now offers powerful CI log search across all your repositories and workflows. With smart filtering by timeframe, runner type, and keywords, you’ll have all the information at your fingertips to debug faster.

Wow, five hundred issues! I sent the first issue of SRE Weekly out almost exactly ten years ago. I assumed my little experiment would fairly quickly come to an end when I exhausted the supply of SRE-related articles.

I needn’t have worried. Somehow, the authors I’ve featured here have continued to produce a seemingly endless stream of excellent articles. If anything, the pace has only picked up over time! A profound thank you to all of the authors, without whom this newsletter would be just an empty bulleted list.

And thanks to you, dear readers, for making this worthwhile. Thanks for sharing the articles you find or write, I love receiving them! Thanks for the notes you send after an issue you particularly like, and the corrections too. Thanks for your kind well-wishes for my recent surgery, they meant a ton.

Finally, thanks to my sponsors, whose support makes all this possible. If you see something interesting, please give it a click and check it out!

When a scale-up event actually causes increased resource usage for awhile, a standard auto-scaling algorithm can fail.

   Minh Nhat Nguyen, Shi Kai Ng, and Calvin Tran — Grab

A database schema change added an index on a large table without using the CONCURRENTLY option, locking the table. This reminds me of a similar incident when I worked for Honeycomb and their solution.

  Ray Chen — Railway

Oof, that’s a harsh title, but this is a great discussion of how we strive to design for reliability even when our downstream vendors have outages.

  Uwe Friedrichsen

This one has a lot of good recommendations for staff-level SREs covering 8 areas, shared by a former Staff SRE.

  Karan Nagarajagowda

A high-throughput Java service was stalling. The culprit? Stop-the-World GC pauses were blocked by synchronous log writes to a busy disk.

   Nataraj Mocherla — DZone

This air accident report video by Mentour Pilot has a great example of alert fatigue around 30 minutes in. The air traffic controllers received enough spurious conflict alerts every day that they became easy to ignore.

  Mentour Pilot

In this post you learn:
* What are emergent properties and what kind of system has them?
* What are weak and strong emergence as opposed to resultant properties?
* How do emergent properties impact the reliability, maintainability, predictability, and cost of the system?

Well worth a read. It really got me thinking about emergence and its relationship to reliability.

  Alex Ewerlöf

In an incident, it’s important to have someone be in charge — and for it to be clear who that is, as explained in this article.

  Joe Mckevitt — Uptime Labs

SRE Weekly Issue #499

The folks at Uptime Labs and Advanced Capacity Labs have announced an advent calendar for this December.

Note: In order to take part, you’ll need to provide an email address to subscribe. I gave that some serious thought before including this here, but ultimately, I have a lot of trust for the folks at both ACL and Uptime Labs, since they’ve both produced so much awesome content that’s been featured here. I’m interested to see what this collab will bring!

  Uptime Labs and Adaptive Capacity Labs

Cool trick: divide short-term P95 latency by the long-term P95 to detect load spikes and adjust rate limits on-the-fly.

  Shravan Gaonkar — Airbnb

Datadog shares the bigger-picture lessons they learned and improvements they made since their major 2023 outage, including an emphasis on graceful degradation.

  Laura de Vesine, Rob Thomas, AND Maciej Kowalewski

This article does a really good job of laying out the problems with serverless that led them to leave: having to layer on significant complexity to deal with the limits of running in Cloudflare workers.

  Andreas Thomas — Unkey

This article explains the two concepts of reliability and fault tolerance and how they relate.

  Oakley Hall

This one could easily be titled, “Today, major system failures meant that I was able to take down production really easily.” There’s some great discussion in the comments, and I hope the author feels better.

  u/Deep-Jellyfish-2383 and others — reddit

Slack shows how they changed their monolithic Chef cookbook change deployment process to reduce risk, by breaking production up into 6 separate environments.

  Archie Gunasekara — Slack

The author discusses reasons why engineer attrition won’t appear in a public incident write-up, and may well not appear in a private one, either.

  Lorin Hochstein

SRE Weekly Issue #498

A message from our sponsor, Costory:

You didn’t sign up to do FinOps. Costory automatically explains why your cloud costs change, and reports it straight to Slack. Built for SREs who want to code, not wrestle with spreadsheets. Now on AWS & GCP Marketplaces.

Start your free trial at costory.io

Cloudflare had a major incident this week, and they say it was their worst since 2019. In this report, they explain what happened, and the failure mode is pretty interesting.

  Matthew Prince — Cloudflare

How we completely rearchitected Mussel, our storage engine for derived data, and lessons learned from the migration from Mussel V1 to V2.

They cover not just the motivation for and improvements in V2, but also the migration process to deploy V2 without interruption.

  Shravan Gaonkar — Airbnb

Netflix’s WAL service acts as a go-between, streaming data to pluggable targets while providing extra functionality like retries, delayed sending, and a dead-letter queue.

  Prudhviraj Karumanchi, Samuel Fu, Sriram Rangarajan, Vidhya Arvind, Yun Wang, and John Lu — Netflix

A (very) deep dive into Datadog’s custom data store, with special attention to how it handles query planning and optimization.

  Sami Tabet — Datadog

Perhaps we should encourage people to write their incident reports as if they will be consumed by an AI SRE tool that will use it to learn as much as possible about the work involved in diagnosing and remediating incidents in your company.

  Lorin Hochstein

we landed on a two-level failure capture design that combines Kafka topics with an S3 backup to ensure no event is ever lost.

  Tanya Fesenko, Collin Crowell, Dmitry Mamyrin, and Chinmay Sawaji — Klaviyo

Buried in this one is this gem: the last layer of reliability is that their client library automatically retries to alternate regions if the main region fails.

  Paddy Byers — Ably

incident.io shares details on how they fared during the AWS us-east-1 incident on October 20.

  Pete Hamilton — incident.io

A production of Tinker Tinker Tinker, LLC Frontier Theme