SRE Weekly Issue #489

Negotiating the Paradox We Face in Resilience Engineering—Lessons From an Engineering Leader

As we learn advanced resilience engineering concepts, this article recommends that we take a balanced approach in how we try to change existing practices.

I can confidently say that when an executive leader wants to be talking about quality of service for your customers, the last thing they want to hear about is academic papers and Monte Carlo simulations.

Michelle Casey — Resilience in Software Foundation

Hashing

I know you probably know all about how hashing works, but this one’s still worth a read. The article includes interactive demonstrations and clearly presents concepts to help you understand how hashing function performance is evaluated.

Sam Rose

How We Migrated the Parse API From Ruby to Golang (Resurrected)

Pulled from the Internet Archive, here’s a story of how the now-defunct Parse rewrote their Ruby on Rails API in Golang, significantly improving reliability.

Charity Majors

How Meta keeps its AI hardware reliable

We are sharing methodologies we deploy at various scales for detecting SDC [Silent Data Corruption] across our AI and non-AI infrastructure to help ensure the reliability of AI training and inference workloads across Meta.

Harish Dattatraya Dixit and Sriram Sankar — Meta

Guarding the herd – managing database servers at scale

As monday.com broke their monolith up into microservices, their number of databases expanded too. To have a chance of managing all of them, they shifted from DBA practices to DBRE.

Mateusz Wojciechowski — monday.com

Achieving High Availability with distributed database on Kubernetes at Airbnb

Airbnb runs a large-scale database on Kubernetes. They have various techniques to deal with the ephemerality of pods and the risks inherent in cluster upgrades.

Artem Danilov — Airbnb

K8sGPT for Kubernetes troubleshooting: How AI helps in different cases

The author of this article brings us along as they do a very thorough evaluation of K8sGPT, showing us what it can do and some ways in which it can fall short.

Evgeny Torin — Palark

Climbing the Communication Ladder During Incidents

What is good incident communication? This article draws on theory from Herbert Clark’s Joint Action Ladder to help us evaluate and strengthen communication.

Stuart Rimell — Uptime Labs

SRE Weekly Issue #489

Subscribe

RSS

Mastodon

Search Issues

A message from our sponsor, Observe, Inc.:

Subscribe

RSS

Mastodon

Search Issues