SRE Weekly Issue #470

A message from our sponsor, incident.io:

Intercom migrated hundreds of engineers from PagerDuty and Atlassian Status Page to incident.io in just weeks, improving resolution times, simplifying incident management, and delivering a better customer support experience. Watch the video case study.

https://go.incident.io/customers/intercom

An SRE thinks about the meaning of “sociotechnical”:

From an SRE perspective, it means that when we’re looking at a piece of software, we can’t just factor out the human decisions that happen both in its operation and usage, but also in its development.

  Clint Byrum

This one is about the difficulties they had with database read replicas that led to developers mostly just sending reads to the primary. They came up with a pretty neat solution to automatically send read queries to the replica when possible.

In case you missed it, here’s part 1.

  Tushar Singla — Nextdoor

This well-thought-out article starts with a solid critique of Five Whys, illustrated with example scenarios. The author then explains why they prefer open-ended questions.

  Hamed Silatani

Spurred by a conversation with engineers, the author of this article explains what retries, backoff, and jitter can fix, and more importantly, when they won’t help.

  Tejas Ghadge — The New Stack

This is a juicy one, involving a routine credential roll gone bad, resulting in an outage in Cloudflare’s R2 service.

  Phillip Jones — Cloudflare

In this series of posts, we illustrate design considerations for a database system throttler, whose purpose is to keep the database system healthy overall. We discuss choice of metrics, granularity, behavior, impact, prioritization, and other topics.

Part 2 is here and part 3 is here.

  Shlomi Noach — Planetscale

I hadn’t heard the term “lurking variable” before, but I definitely know the concept. This article is a must-read for anyone troubleshooting tricky problems in production, and especially for earlier-career folks developing their skills.

  Teiva Harsanyi — The Coder Cafe

This article gives 4 strategies to better handle situations when database queries need to join across data residing in separate shards.

   Baskar Sikkayan — DZone

SRE Weekly Issue #469

A message from our sponsor, incident.io:

Speed isn’t everything. We studied 100K+ incidents to find out what actually makes for good incident management—from detection to follow-up. You can now view the recording of our latest live event to get even more info on the benchmarks, insights, and real-life examples from the report.

https://go.incident.io/events/going-beyond-mttx

I’ve shared this article before, but it’s so critical that it’s time to give it another read. MTTR is a statistically useless metric, and by using it, we will draw faulty conclusions and potentially take harmful actions. Courtney Nash does a really great job of laying out the science in an easy-to-understand way.

  Courtney Nash — Resilience in Software Foundation / The VOID

I like the analogy here: when we say people are components in or sociotechnical systems, system diagrams are like a form of cache.

  Clint Byrum

From Werner Vogels’s intro to this article:

Andy takes us through S3’s evolution from simple object store to sophisticated data platform, illustrating how customer feedback has shaped every aspect of the service. It’s a fascinating look at how we maintain simplicity even as systems scale to handle hundreds of trillions of objects.

  Andy Warfield — Amazon

Instead of a traditional Cost/Performance/Reliability trade-off, this article argues that serverless presents a tradeoff of Cost, Performance, and Complexity.

  Luc van Donkersgoed

Google uses System Theoretic Process Analysis to identify problems in their systems. They found that the most effective way to spread adoption of STPA was to build their own training program.

  Garrett Holthaus — Google

So far, I’m liking this new post series from Nextdoor about their efforts to scale their datastore. Here’s the first installment, about the things they’ve tried up to now.

I’ll share the rest of the series as I work my way through them.

  Slava Markeyev — Nextdoor

Wow, I had no idea EBS volumes failed this often!

  Nick Van Wiggeren — PlanetScale

SRE Weekly Issue #468

A message from our sponsor, incident.io:

MTTx metrics fall short—learn the new industry benchmarks for measuring and improving incident management. Join us on Tuesday, March 18th to discover data-driven insights from 100K+ incidents and practical steps to enhance your response strategy.

https://go.incident.io/registration.goldcast.io/webinar/going-beyond-mttx-measuring-what-good-incident-management-looks-like

No matter how bullet-proof you build the components of your system, the only way to make nines go up is to be ready to deal with the host of surprises that take them back down.

  Clint Byrum

Here’s an example of a really great application of bloom filters, in which speed is key and a slight risk of false is acceptable.

  Alex Gardiner — Klaviyo

This fun video gives us a small glimpse into the world of traffic light controllers, and more importantly, what makes them reliable. There’s also a longer video that goes deeper into why a Raspberry Pi isn’t up to the job.

  Traffic Light Doctor

Here’s an overview of several options to scale Prometheus beyond a single instance, including a handy table of features and functionality.

  Gaurav Maheshwari

A nice guide for using incident analysis in your home lab setup, plus a write-up for an incident experienced by the author.

  Barush Mendez

A highly detailed explanation of Paxos with diagrams and a model in FizzBee.

  Lorin Hochstein

I’ve boiled my frustration down to three problems:

  1. No one agrees on what “microservice” means.
  2. Microservices conversations are abstract, with little tie-in to real business goals
  3. Adopting microservices without changing your organisation is pointless.

  Ian Miell — Container Solutions

SRE Weekly Issue #467

A message from our sponsor, incident.io:

SEV0 is back. This fall, we’re bringing together the best minds in incident management for a day of learning, sharing, and networking in San Francisco and London. RSVP now—tickets are complimentary.

https://go.incident.io/SEV0-2025

It’s been awhile since we’ve seen any updates from the LFI folks, but here’s a brand new home for the community. I’ve bought my membership.

I like this article’s measured approach to anomaly detection and other AIOps features. Will it work? With your data?

  Jacek Migdal — Quesma

A structured approach to system design includes defining the problem, scope, tenets, risks, assumptions, and architecture choices.

I like how this article follows the process it lays out by writing an example design for a distributed search engine.

  Nikunj Agarwal — DZone

A mental model to detect and prevent optimizing the wrong thing, at the wrong time, or for the wrong reasons

This is the first time I’ve seen premature optimization dissected in this way, and I really like this model.

  Alex Ewerlöf

My favorite part of this podcast episode is the discussion of the unintended consequences of automation and “humans-are-better-at/machines-are-better-at” oversimplification. The transcript is great in case you’re not able to listen.

  Shane Hastie, with guest Courtney Nash — InfoQ

What role is an AI tool going to play in your sociotechnical system? This article gives you 12 insightful questions that will help guide your approach.

  Fred Hebert — Honeycomb

As long as there’s at least one HDD ‘tape’ filesystem mounted, you can count them, but once there are none, the result of counting them is not 0 but nothing.

And “nothing” doesn’t cause an alert. Oops!

  Chris Siebenmann

SRE Weekly Issue #466

A bit of a short issue this week, as I spent most of my weekend at my child’s first First Robotics Competition of the season. FRC truly is a microcosm of reliability engineering, balancing limited time and resources while trying to produce the most reliable bot possible.

A message from our sponsor, incident.io:

What does “good” incident management look like? MTTx metrics track speed, but speed alone doesn’t mean success. We analyzed 100,000+ incidents from companies of all sizes to identify benchmarks for every stage of the incident lifecycle. See how your team stacks up.

https://go.incident.io/good-incident-management-report

Just because Google, Amazon, or Facebook does it doesn’t mean you should. Here are four ‘best practices’ of the hyperscalers you have permission to ignore.

  Matt Asay — InfoWorld

An introduction to distributed transactions using the Saga pattern, including pros and cons and two approaches for implementing sagas.

  Sid — Scalable Thread

Here’s an argument against real-world “war rooms” for incident response, including a great incident story as an example.

I can’t imagine doing that kind of multi-window parallel investigation stuff on a teeny little laptop screen with people right next to me on either side

  rachelbythebay

This one includes a list of responsibilities a lead incident responder has and another list of things they should delegate.

Incident lead isn’t an extra job that you do “on top of” engineering. It’s the main job.

  r/devoopseng — Reddit r/sre

Scaling Elasticsearch requires balancing sharding, query performance, and memory tuning for optimal efficiency in high-traffic, real-time applications.

   Vivek Kumar — DZone

A production of Tinker Tinker Tinker, LLC Frontier Theme