SRE Weekly Issue #477

Why don’t we look for the root cause of a successful outcome?

  Hamed Silatani — Uptime Labs

They took a great deal of care to avoid the potential pitfalls of using an LLM in this way, and they share a lot of detail about the steps they took.

  Tran Le, Till Pieper, and Gillian McGarvey — Datadog

After dealing with a late-night outage with surprisingly small impact, I got thinking about how you would know if you were working too hard to guarantee uptime.

  Tom Elliott

In this article, learn how the 4 R’s — robust architecture, resumability, recoverability, and redundancy — enhance reliability in AI and ML data pipelines.

   Sidhant bendre — DZone

In this article, I’ll delve into the challenges we encountered and the strategies we employed to manage operator upgrades for stateful workloads like Elasticsearch. Additionally, I’ll detail how we modified the ECK [Elastic Cloud on Kubernetes] operator to facilitate a more resilient side-by-side upgrade process.

  Abhishek Munagekar — Mercari

In this piece, I’ll delve into four macro challenges facing observability today, explore strategies that are emerging across the industry to address them, and offer my perspective on the trajectory of this crucial domain in the year to come.

  Andrew Mallaband

A deep-dive into a pretty nifty system for enumerating and provisioning a rack of servers, complete with PXE-based Debian headless installation using an auto-generated preseed file. It also uses Claude to figure out what state a server is in from a screenshot obtained from the BMC.

  Charith Amarasinghe — Railway

Koreo is a new open source tool for orchestrating Kubernetes infrastructure at a higher level than standard tools like Helm.

Koreo is a fairly complex tool, so it can be difficult to quickly grasp just what exactly it is, what problems it’s designed to solve, and how it compares to other, similar tools. In this post, I want to dive into these topics and also discuss the original motivation behind Koreo.

  Tyler Treat

This one is about understanding how work actually happens in our sociotechnical systems (versus how we imagine it). This has implications for how we learn from incidents and how we design corrective actions.

  Lorin Hochstein

Updated: May 18, 2025 — 9:33 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme