[…] when Kubernetes is involved, the number of alert sources can skyrocket quickly. This article will reflect on some common causes of alert fatigue and share tips to help reduce it.
Nate Matherson — DZone
Meta has a special system to warn servers about power outages, giving them 45 seconds of battery power to finish things up and get ready to shut down.
Raghunathan Modoor Jagannathan, Sulav Malla, and Parimala Kondety — Meta
This is an approachable explanation of the Paxos algorithm with examples, diagrams, and code.
But what does reliability mean for people outside of engineering? And how does it translate into best practices for other teams?
Emily Arnott — Blameless
“The Practice of Practice” is a concept from improvisational music. This article artfully applies the idea to the practice of incident response.
Matt Davis — Blameless
I haven’t heard of this technique being used before, assigning alerts to on-call folks in round-robin order as they come in. I wonder if there’s a reason for that…
Hannah Culver — PagerDuty
Raise your hand if you’ve been bitten by DNS before.