Articles
Reading this article will teach you the math you need to know to build alerting that has a low false positive rate and why this is trickier than it may seem.
Dan Slimmon
Cloudflare adapted a technique shared by PagerDuty to detect failed Kafka consumers and restart them.
Chris Shepherd and Andrea Medda β Cloudflare
Gracefully shutting down is important, otherwise every deploy will result in client-facing errors.
Srinavas β eightnoteight
There’s a wealth of lessons learned in this article. My favorite: idempotency was never part of the contract, but consumers nevertheless depended on it.
Gergely Orosz β Pragmatic Engineer
Making our companies into High Reliability Organizations (HROs) rarely makes sense, but we can still learn useful skills and techniques from them. This article gives a good overview and analysis of HROs.
Dominic Cooper β Safety & Health Practitioner
By “tiered”, this article means having discussions about reliability at three levels: the engineering team level, the director level, and the executive level.
Ross Brodbeck
After explaining why deploys aren’t the right approach, this article proposes feature flags as a safer approach.
Charity Majors β Honeycomb
Full disclosure: Honeycomb is my employer.