SRE Weekly Issue #500

View on sreweekly.com

Wow, five hundred issues! I sent the first issue of SRE Weekly out almost exactly ten years ago. I assumed my little experiment would fairly quickly come to an end when I exhausted the supply of SRE-related articles.

I needn’t have worried. Somehow, the authors I’ve featured here have continued to produce a seemingly endless stream of excellent articles. If anything, the pace has only picked up over time! A profound thank you to all of the authors, without whom this newsletter would be just an empty bulleted list.

And thanks to you, dear readers, for making this worthwhile. Thanks for sharing the articles you find or write, I love receiving them! Thanks for the notes you send after an issue you particularly like, and the corrections too. Thanks for your kind well-wishes for my recent surgery, they meant a ton.

Finally, thanks to my sponsors, whose support makes all this possible. If you see something interesting, please give it a click and check it out!

Machine-learning predictive autoscaling for Flink

When a scale-up event actually causes increased resource usage for awhile, a standard auto-scaling algorithm can fail.

Minh Nhat Nguyen, Shi Kai Ng, and Calvin Tran — Grab

[Railway] Incident Report: October 28th, 2025

A database schema change added an index on a large table without using the CONCURRENTLY option, locking the table. This reminds me of a similar incident when I worked for Honeycomb and their solution.

Ray Chen — Railway

It is your fault if your application is down

Oof, that’s a harsh title, but this is a great discussion of how we strive to design for reliability even when our downstream vendors have outages.

Uwe Friedrichsen

Advice for First-Time Staff SREs

This one has a lot of good recommendations for staff-level SREs covering 8 areas, shared by a former Staff SRE.

Karan Nagarajagowda

The JVM Pause That Wasn’t: A War Story

A high-throughput Java service was stalling. The culprit? Stop-the-World GC pauses were blocked by synchronous log writes to a busy disk.

Nataraj Mocherla — DZone

The Tragedy of PSA Flight 182

This air accident report video by Mentour Pilot has a great example of alert fatigue around 30 minutes in. The air traffic controllers received enough spurious conflict alerts every day that they became easy to ignore.

Mentour Pilot

Emergent properties

In this post you learn:
* What are emergent properties and what kind of system has them?
* What are weak and strong emergence as opposed to resultant properties?
* How do emergent properties impact the reliability, maintainability, predictability, and cost of the system?

Well worth a read. It really got me thinking about emergence and its relationship to reliability.

Alex Ewerlöf

Who’s in Charge?

In an incident, it’s important to have someone be in charge — and for it to be clear who that is, as explained in this article.

Joe Mckevitt — Uptime Labs

SRE Weekly Issue #500

Subscribe

RSS

Mastodon

Search Issues

A message from our sponsor, Depot:

Subscribe

RSS

Mastodon

Search Issues