SRE Weekly Issue #367

A message from our sponsor, Rootly:

Manage incidents directly from Slack with Rootly πŸš’.

Rootly automates manual tasks like creating an incident channel, Jira ticket and Zoom rooms, inviting responders, creating statuspage updates, postmortem timelines and more. Want to see why companies like Canva and Grammarly love us?:


Reading this article will teach you the math you need to know to build alerting that has a low false positive rate and why this is trickier than it may seem.

  Dan Slimmon

Cloudflare adapted a technique shared by PagerDuty to detect failed Kafka consumers and restart them.

   Chris Shepherd and Andrea Medda β€” Cloudflare

Gracefully shutting down is important, otherwise every deploy will result in client-facing errors.

  Srinavas β€” eightnoteight

There’s a wealth of lessons learned in this article. My favorite: idempotency was never part of the contract, but consumers nevertheless depended on it.

  Gergely Orosz β€” Pragmatic Engineer

Making our companies into High Reliability Organizations (HROs) rarely makes sense, but we can still learn useful skills and techniques from them. This article gives a good overview and analysis of HROs.

  Dominic Cooper β€” Safety & Health Practitioner

By “tiered”, this article means having discussions about reliability at three levels: the engineering team level, the director level, and the executive level.

  Ross Brodbeck

After explaining why deploys aren’t the right approach, this article proposes feature flags as a safer approach.

  Charity Majors β€” Honeycomb
  Full disclosure: Honeycomb is my employer.

Updated: April 9, 2023 — 8:47 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme