SRE Weekly Issue #517

A message from our sponsor, BigPanda:

No single team sees the full incident anymore.

Today’s P1s break across services, teams, and infrastructure. Instead of chasing dashboards, waiting on tribal knowledge, or piecing together signals from siloed systems, BigPanda surfaces the complete picture to pinpoint root cause faster.

See BigPanda for SREs

There’s some great advice in here. My favorite: be explicit about choosing or not choosing to do something.

  incident.io

Live video delivery is an intensely reliability-critical endeavor, and Netflix pull back on the curtain on how they tackled it.

  Brett Axler, Casper Choffat, and Alo Lowry — Netflix

Java uses memory outside of the heap, so it can OOM in a container even if the heap size is well below the container’s memory limit.

   Ramya vani Rayala — DZone

It’s not about obviously wrong stuff — it’s the queries that look good on the surface that can bet you in trouble, per this article. They also share methods to vet LLM-generated SQL.

  Readyset

The mental model we use: AI handles the effort so humans can focus on the insight. Not AI instead of thinking.

  incident.io

[…] because AI tools continue to make it cheaper to write (and rewrite) code on demand, production environments will become the primary place to evaluate whether software is correct or incorrect.

  Peter Farago — RunLLM

The old way: heroes in incident response are an anti-pattern.
The new way: heroes are great and we should make as many heroes as we can.

  Hamed Silatani — Uptime Labs

I had to read this one twice before I had my galaxy-brain moment in the second-to-last paragraph.

  Lorin Hochstein

Updated: May 17, 2026 — 10:29 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme