An SRE thinks about the meaning of “sociotechnical”:
From an SRE perspective, it means that when we’re looking at a piece of software, we can’t just factor out the human decisions that happen both in its operation and usage, but also in its development.
Clint Byrum
This one is about the difficulties they had with database read replicas that led to developers mostly just sending reads to the primary. They came up with a pretty neat solution to automatically send read queries to the replica when possible.
In case you missed it, here’s part 1.
Tushar Singla — Nextdoor
This well-thought-out article starts with a solid critique of Five Whys, illustrated with example scenarios. The author then explains why they prefer open-ended questions.
Hamed Silatani
Spurred by a conversation with engineers, the author of this article explains what retries, backoff, and jitter can fix, and more importantly, when they won’t help.
Tejas Ghadge — The New Stack
This is a juicy one, involving a routine credential roll gone bad, resulting in an outage in Cloudflare’s R2 service.
Phillip Jones — Cloudflare
In this series of posts, we illustrate design considerations for a database system throttler, whose purpose is to keep the database system healthy overall. We discuss choice of metrics, granularity, behavior, impact, prioritization, and other topics.
Part 2 is here and part 3 is here.
Shlomi Noach — Planetscale
I hadn’t heard the term “lurking variable” before, but I definitely know the concept. This article is a must-read for anyone troubleshooting tricky problems in production, and especially for earlier-career folks developing their skills.
Teiva Harsanyi — The Coder Cafe
This article gives 4 strategies to better handle situations when database queries need to join across data residing in separate shards.
Baskar Sikkayan — DZone