In this post, we’ll explore the reasons that OOM kills can occur and provide tactics to combat and prevent them.
Will Searle — Causely
The high-plateau of basic resilience is the third interim stop, companies tend to reach on their journey towards resilience.
I especially enjoyed the bit about how trying to add robustness can paradoxically diminish overall reliability, reminiscent of Lorin Hochstein and others.
Uwe Friedrichsen
What happens when you move your DB and network latency goes from 0.5ms to 10ms? Time to find out by experimenting (carefully).
Lawrence Jones
I’ve only used Kubernetes under Amazon EKS, which handles running etcd, so this guide helped fill in some gaps in my knowledge. Of course, under EKS, you still need to pay attention to etcd.
David M. Lentz — Datadog
Google folks share how they’ve applied System-Theoretic Accident Model and Processes (STAMP) to SRE at Google. This really stood out to me:
A design might implement its requirements flawlessly. But what if requirements necessary for the system to be safe were incorrect or, even worse, missing altogether?
Tim Falzone and Ben Treynor Sloss — USENIX ;login:
Search and rescue (SAR) operations and incident response have striking similarities. In this series, Claire dives into lessons SREs can learn from wildfire management ICSs.
I really love learning about ICS from the veterans who use it for actual emergencies!
Claire Leverne — Rootly
Runbooks are programs for an imperfect execution engine of highly variable quality.
What happens when the runbook meets reality?
Jos Visser
This is a really great one! Several factors combined to cause the outage, and they’re all laid out in juicy detail.
Brendan Humphreys — Canva
Here’s Lorin Hochstein’s take on Canva’s outage report.
Lorin Hochstein