SRE Weekly Issue #500

A message from our sponsor, Depot:

Stop hunting through GitHub Actions logs. Depot now offers powerful CI log search across all your repositories and workflows. With smart filtering by timeframe, runner type, and keywords, you’ll have all the information at your fingertips to debug faster.

Wow, five hundred issues! I sent the first issue of SRE Weekly out almost exactly ten years ago. I assumed my little experiment would fairly quickly come to an end when I exhausted the supply of SRE-related articles.

I needn’t have worried. Somehow, the authors I’ve featured here have continued to produce a seemingly endless stream of excellent articles. If anything, the pace has only picked up over time! A profound thank you to all of the authors, without whom this newsletter would be just an empty bulleted list.

And thanks to you, dear readers, for making this worthwhile. Thanks for sharing the articles you find or write, I love receiving them! Thanks for the notes you send after an issue you particularly like, and the corrections too. Thanks for your kind well-wishes for my recent surgery, they meant a ton.

Finally, thanks to my sponsors, whose support makes all this possible. If you see something interesting, please give it a click and check it out!

When a scale-up event actually causes increased resource usage for awhile, a standard auto-scaling algorithm can fail.

   Minh Nhat Nguyen, Shi Kai Ng, and Calvin Tran — Grab

A database schema change added an index on a large table without using the CONCURRENTLY option, locking the table. This reminds me of a similar incident when I worked for Honeycomb and their solution.

  Ray Chen — Railway

Oof, that’s a harsh title, but this is a great discussion of how we strive to design for reliability even when our downstream vendors have outages.

  Uwe Friedrichsen

This one has a lot of good recommendations for staff-level SREs covering 8 areas, shared by a former Staff SRE.

  Karan Nagarajagowda

A high-throughput Java service was stalling. The culprit? Stop-the-World GC pauses were blocked by synchronous log writes to a busy disk.

   Nataraj Mocherla — DZone

This air accident report video by Mentour Pilot has a great example of alert fatigue around 30 minutes in. The air traffic controllers received enough spurious conflict alerts every day that they became easy to ignore.

  Mentour Pilot

In this post you learn:
* What are emergent properties and what kind of system has them?
* What are weak and strong emergence as opposed to resultant properties?
* How do emergent properties impact the reliability, maintainability, predictability, and cost of the system?

Well worth a read. It really got me thinking about emergence and its relationship to reliability.

  Alex Ewerlöf

In an incident, it’s important to have someone be in charge — and for it to be clear who that is, as explained in this article.

  Joe Mckevitt — Uptime Labs

Updated: December 7, 2025 — 9:05 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme