Reading this article will teach you the math you need to know to build alerting that has a low false positive rate and why this is trickier than it may seem.
Cloudflare adapted a technique shared by PagerDuty to detect failed Kafka consumers and restart them.
Chris Shepherd and Andrea Medda — Cloudflare
Gracefully shutting down is important, otherwise every deploy will result in client-facing errors.
Srinavas — eightnoteight
There’s a wealth of lessons learned in this article. My favorite: idempotency was never part of the contract, but consumers nevertheless depended on it.
Gergely Orosz — Pragmatic Engineer
Making our companies into High Reliability Organizations (HROs) rarely makes sense, but we can still learn useful skills and techniques from them. This article gives a good overview and analysis of HROs.
Dominic Cooper — Safety & Health Practitioner
By “tiered”, this article means having discussions about reliability at three levels: the engineering team level, the director level, and the executive level.
After explaining why deploys aren’t the right approach, this article proposes feature flags as a safer approach.
Charity Majors — Honeycomb
Full disclosure: Honeycomb is my employer.