SRE Weekly Issue #517

There’s some great advice in here. My favorite: be explicit about choosing or not choosing to do something.

incident.io

The Human Infrastructure: How Netflix Built the Operations Layer Behind Live at Scale

Live video delivery is an intensely reliability-critical endeavor, and Netflix pull back on the curtain on how they tackled it.

Brett Axler, Casper Choffat, and Alo Lowry — Netflix

The Invisible OOMKill: Why Your Java Pod Keeps Restarting in Kubernetes

Java uses memory outside of the heap, so it can OOM in a container even if the heap size is well below the container’s memory limit.

Ramya vani Rayala — DZone

Why LLMs Write Incorrect SQL (and What That Means for Your Database)

It’s not about obviously wrong stuff — it’s the queries that look good on the surface that can bet you in trouble, per this article. They also share methods to vet LLM-generated SQL.

Readyset

What does using AI for post-mortems actually mean?

The mental model we use: AI handles the effort so humans can focus on the insight. Not AI instead of thinking.

incident.io

The Code Nobody Read Is Already in Production

[…] because AI tools continue to make it cheaper to write (and rewrite) code on demand, production environments will become the primary place to evaluate whether software is correct or incorrect.

Peter Farago — RunLLM

The Incident Hero Trap

The old way: heroes in incident response are an anti-pattern.
The new way: heroes are great and we should make as many heroes as we can.

Hamed Silatani — Uptime Labs

How incidents can teach us about what’s already working well

I had to read this one twice before I had my galaxy-brain moment in the second-to-last paragraph.

Lorin Hochstein

SRE Weekly Issue #517

Subscribe

RSS

Mastodon

Search Issues

A message from our sponsor, BigPanda:

Subscribe

RSS

Mastodon

Search Issues