SRE Weekly Issue #457

A message from our sponsor, FireHydrant:

This New Year, resolve to make incident management smarter, faster, and way less stressful with FireHydrant. Modern on-call, automated incident response, and AI tools that do the heavy lifting.

https://firehydrant.com/

In this post, we’ll explore the reasons that OOM kills can occur and provide tactics to combat and prevent them.

  Will Searle — Causely

The high-plateau of basic resilience is the third interim stop, companies tend to reach on their journey towards resilience.

I especially enjoyed the bit about how trying to add robustness can paradoxically diminish overall reliability, reminiscent of Lorin Hochstein and others.

  Uwe Friedrichsen

What happens when you move your DB and network latency goes from 0.5ms to 10ms? Time to find out by experimenting (carefully).

  Lawrence Jones

I’ve only used Kubernetes under Amazon EKS, which handles running etcd, so this guide helped fill in some gaps in my knowledge. Of course, under EKS, you still need to pay attention to etcd.

  David M. Lentz — Datadog

Google folks share how they’ve applied System-Theoretic Accident Model and Processes (STAMP) to SRE at Google. This really stood out to me:

A design might implement its requirements flawlessly. But what if requirements necessary for the system to be safe were incorrect or, even worse, missing altogether? 

  Tim Falzone and Ben Treynor Sloss — USENIX ;login:

Search and rescue (SAR) operations and incident response have striking similarities. In this series, Claire dives into lessons SREs can learn from wildfire management ICSs.

I really love learning about ICS from the veterans who use it for actual emergencies!

  Claire Leverne — Rootly

Runbooks are programs for an imperfect execution engine of highly variable quality.

What happens when the runbook meets reality?

  Jos Visser

This is a really great one! Several factors combined to cause the outage, and they’re all laid out in juicy detail.

  Brendan Humphreys — Canva

Here’s Lorin Hochstein’s take on Canva’s outage report.

  Lorin Hochstein

Updated: December 29, 2024 — 9:32 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme