SRE Weekly Issue #367

Articles

Car alarms and smoke alarms: the tradeoff between sensitivity and specificity

Reading this article will teach you the math you need to know to build alerting that has a low false positive rate and why this is trickier than it may seem.

Dan Slimmon

Intelligent, automatic restarts for unhealthy Kafka consumers

Cloudflare adapted a technique shared by PagerDuty to detect failed Kafka consumers and restart them.

Chris Shepherd and Andrea Medda — Cloudflare

Graceful Shutdown

Gracefully shutting down is important, otherwise every deploy will result in client-facing errors.

Srinavas — eightnoteight

Distributed systems learnings in 2019

There’s a wealth of lessons learned in this article. My favorite: idempotency was never part of the contract, but consumers nevertheless depended on it.

Gergely Orosz — Pragmatic Engineer

Great expectations: Are high-reliability organisations perfect?

Making our companies into High Reliability Organizations (HROs) rarely makes sense, but we can still learn useful skills and techniques from them. This article gives a good overview and analysis of HROs.

Dominic Cooper — Safety & Health Practitioner

Tiered Availability Review

By “tiered”, this article means having discussions about reliability at three levels: the engineering team level, the director level, and the executive level.

Ross Brodbeck

Deploys Are the ✨WRONG✨ Way to Change User Experience

After explaining why deploys aren’t the right approach, this article proposes feature flags as a safer approach.

Charity Majors — Honeycomb
Full disclosure: Honeycomb is my employer.

SRE Weekly Issue #367

Articles

Subscribe

RSS

Mastodon

Search Issues

A message from our sponsor, Rootly:

Articles

Subscribe

RSS

Mastodon

Search Issues