SRE Weekly Issue #489

A message from our sponsor, Observe, Inc.:

Observe‘s free Masterclass in Observability at Scale is coming on September 4th at 10am Pacific! We’ll explore how to architect for observability at scale – from streaming telemetry and open data lakes to AI agents that proactively instrument your code and surface insights.

Learn more and register today!

As we learn advanced resilience engineering concepts, this article recommends that we take a balanced approach in how we try to change existing practices.

I can confidently say that when an executive leader wants to be talking about quality of service for your customers, the last thing they want to hear about is academic papers and Monte Carlo simulations.

  Michelle Casey Resilience in Software Foundation

I know you probably know all about how hashing works, but this one’s still worth a read. The article includes interactive demonstrations and clearly presents concepts to help you understand how hashing function performance is evaluated.

  Sam Rose

Pulled from the Internet Archive, here’s a story of how the now-defunct Parse rewrote their Ruby on Rails API in Golang, significantly improving reliability.

  Charity Majors

We are sharing methodologies we deploy at various scales for detecting SDC [Silent Data Corruption] across our AI and non-AI infrastructure to help ensure the reliability of AI training and inference workloads across Meta.

  Harish Dattatraya Dixit and Sriram Sankar — Meta

As monday.com broke their monolith up into microservices, their number of databases expanded too. To have a chance of managing all of them, they shifted from DBA practices to DBRE.

  Mateusz Wojciechowski — monday.com

Airbnb runs a large-scale database on Kubernetes. They have various techniques to deal with the ephemerality of pods and the risks inherent in cluster upgrades.

  Artem Danilov — Airbnb

The author of this article brings us along as they do a very thorough evaluation of K8sGPT, showing us what it can do and some ways in which it can fall short.

  Evgeny Torin — Palark

What is good incident communication? This article draws on theory from Herbert Clark’s Joint Action Ladder to help us evaluate and strengthen communication.

  Stuart Rimell — Uptime Labs

Updated: August 10, 2025 — 10:21 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme