SRE Weekly Issue #470

A message from our sponsor, incident.io:

Intercom migrated hundreds of engineers from PagerDuty and Atlassian Status Page to incident.io in just weeks, improving resolution times, simplifying incident management, and delivering a better customer support experience. Watch the video case study.

https://go.incident.io/customers/intercom

An SRE thinks about the meaning of “sociotechnical”:

From an SRE perspective, it means that when we’re looking at a piece of software, we can’t just factor out the human decisions that happen both in its operation and usage, but also in its development.

  Clint Byrum

This one is about the difficulties they had with database read replicas that led to developers mostly just sending reads to the primary. They came up with a pretty neat solution to automatically send read queries to the replica when possible.

In case you missed it, here’s part 1.

  Tushar Singla — Nextdoor

This well-thought-out article starts with a solid critique of Five Whys, illustrated with example scenarios. The author then explains why they prefer open-ended questions.

  Hamed Silatani

Spurred by a conversation with engineers, the author of this article explains what retries, backoff, and jitter can fix, and more importantly, when they won’t help.

  Tejas Ghadge — The New Stack

This is a juicy one, involving a routine credential roll gone bad, resulting in an outage in Cloudflare’s R2 service.

  Phillip Jones — Cloudflare

In this series of posts, we illustrate design considerations for a database system throttler, whose purpose is to keep the database system healthy overall. We discuss choice of metrics, granularity, behavior, impact, prioritization, and other topics.

Part 2 is here and part 3 is here.

  Shlomi Noach — Planetscale

I hadn’t heard the term “lurking variable” before, but I definitely know the concept. This article is a must-read for anyone troubleshooting tricky problems in production, and especially for earlier-career folks developing their skills.

  Teiva Harsanyi — The Coder Cafe

This article gives 4 strategies to better handle situations when database queries need to join across data residing in separate shards.

   Baskar Sikkayan — DZone

Updated: March 30, 2025 — 9:55 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme