SRE Weekly Issue #473

A message from our sponsor, incident.io:

We’ve just raised $62M at incident.io to build AI agents that resolve incidents with you. See how we’re pioneering a new era of incident management.

https://go.incident.io/blog/incident.io-raises-62m

In this final installment of the Scaling Nextdoor’s Datastores blog series, we detail how the Core-Services team at Nextdoor solved cache consistency challenges as part of a holistic approach to improve our database and cache scalability and usability.

I really enjoyed this whole series. Thanks, Nextdoor folks!

  Slava Markeyev — Nextdoor

These folks analyzed a non-production incident like it was production, including retrospective analysis and lessons learned. Best part: they share the juicy details with us!

  Joe Mckevitt — UptimeLabs

This one goes over several different models you can use to implement on-call compensation, with pros and cons for each.

  Constant Fischer — PagerDuty

This article shows that MySQL’s CATS algorithm offers only a small performance gain over FIFO once deadlock logging interference is removed.

My jaw involuntarily opened when I saw the graph after they commented out the logging print statements.

   Bin Wang — DZone

In this article, I’ll walk you through how we implemented chaos engineering across our stack using Chaos Toolkit, Chaos Monkey, and Istio — with hands-on examples for Java and Node.js. If you’re exploring ways to strengthen system resilience, this guide is packed with practical insights you can apply today.

The author does not appear to have a tie to Istio. This article has a ton of code snippets to help you get started.

   Prabhu Chinnasamy — DZone

In this blog, we’ll look at three important facts about serverless reliability that teams often overlook. We’ll explain what they are, what the risks are of not addressing them, and how you can make your serverless applications more fault-tolerant.

  1. Serverless architectures don’t guarantee reliability.
  2. You do have control over serverless reliability.
  3. Serverless reliability practices can benefit all platforms, not just serverless platforms.

  Andre Newman — Gremlin

This Golang debugging story is a really satisfying read.

The heap profiles were very effective at telling us the allocation sites of live objects, but provided no insights into why specific objects were being retained.

  Ella Chao — WarpStream

Zoom had an outage this week when its domain zoom.us was temporarily blocked at the TLD level due to a miscommunication between its registrar and the TLD.

  Zoom

SRE Weekly Issue #472

A message from our sponsor, incident.io:

We’ve just raised $62M at incident.io to build AI agents that resolve incidents with you. See how we’re pioneering a new era of incident management.

https://go.incident.io/blog/incident.io-raises-62m

In this part of the Scaling Nextdoor’s Datastores blog series, we will see how the Core-Services team at Nextdoor keeps its cache consistent with database updates and avoids stale writes to the cache.

  Ronak Shah — Nextdoor

Okay, if we’re not supposed to use MTTR, what metrics in incident response are better?

  Chris Evans — incident.io

  This article is published by my sponsor, incident.io, but their sponsorship did not influence its inclusion in this issue.

This reminds me of the Fallacies of Distributed Computing, and it’s equally important to internalize. Disk I/O isn’t guaranteed.

  Phil Eaton

Here’s a great example of how we can learn a ton from near misses. In this airplane incident, a slight change in the normal takeoff sequence resulted in missing a critical step. As a result of this near miss, the aviation industry still instituted changes to make this kind of problem less likely.

  Mentour Pilot — YouTube

In this second and final post of this little blog series, we will discuss the redundancy fallacy and the 3rd type of coupling, we need to consider in the context of remote communication, which is temporal coupling.

  Uwe Friedrichsen

All of our systems have embedded models of the world. What happens when these models are wrong?

  Lorin Hochstein

This article answers this question:

“If we had to choose just three things to sustain a resilient, healthy reliability culture, what would they be?”

with these three things:

  1. Know what matters to your users, and make it really visible
  2. Create Psychological Safety Around Failure
  3. Let incidents update your mental models

  Busra Koken

Execs intruding in incidents can have a disruptive effect, which this article acknowledges with specific examples. It goes on to list some concrete and useful things execs can do to support incident response.

By the way, massive props to the Uptime Labs folks. They created an RSS feed for their blog at my request with a super-fast turnaround. Incredible!

  Hamed Silatani — Uptime Labs

SRE Weekly Issue #471

A message from our sponsor, incident.io:

We’re building an AI agent that investigates incidents with you—diagnosing the problem and even fixing it. Go behind the scenes with the incident.io engineers rethinking what’s possible with AI, one ambitious idea (and bug) at a time.

https://go.incident.io/building-with-ai

The author of this one draws a line between their two interests of formal methods and resilience engineering, and I’m so here for it.

  Lorin Hochstein

In this part of the Scaling Nextdoor’s Datastores blog series, we’ll explore how the Core-Services team at Nextdoor serializes database data for caching while ensuring forward and backward compatibility between the cache and application code.

  Ronak Shah — Nextdoor

MySQL’s ALTER TABLE INPLACE has limitations and downsides, and INSTANT does too, as explained in this article.

  Shlomi Noach — Planetscale

If you have multiple different types of work in your system, a queue per type of work may be a good choice.

Bonus(?): includes a bathroom-based analogy.

  Marc Brooker

One Lambda function per URL path? Or a monolithic function that handles multiple paths? There are benefits and drawbacks to each.

  Yan Cui

Published on April 1.

The truth is, many incidents move faster when there’s executive oversight — a sense of urgency, pressure, and someone repeatedly asking, “What’s the ETA?”

  Chris Evans — incident.io

  This article is published by my sponsor, incident.io, but their sponsorship did not influence its inclusion in this issue.

I’m seeing a lot of echoes of Bainbridge’s Ironies of Automation in this article about AIOps and AI tooling. If AI handles most coding and incidents, how will humans handle the outliers?

  Hamed Silatani — Uptime Labs

I wasn’t able to make it, so I really appreciate this recap. Sounds like SRECon was, unsurprisingly, heavily focused on AI this time around.

  Niall Murphy

SRE Weekly Issue #470

A message from our sponsor, incident.io:

Intercom migrated hundreds of engineers from PagerDuty and Atlassian Status Page to incident.io in just weeks, improving resolution times, simplifying incident management, and delivering a better customer support experience. Watch the video case study.

https://go.incident.io/customers/intercom

An SRE thinks about the meaning of “sociotechnical”:

From an SRE perspective, it means that when we’re looking at a piece of software, we can’t just factor out the human decisions that happen both in its operation and usage, but also in its development.

  Clint Byrum

This one is about the difficulties they had with database read replicas that led to developers mostly just sending reads to the primary. They came up with a pretty neat solution to automatically send read queries to the replica when possible.

In case you missed it, here’s part 1.

  Tushar Singla — Nextdoor

This well-thought-out article starts with a solid critique of Five Whys, illustrated with example scenarios. The author then explains why they prefer open-ended questions.

  Hamed Silatani

Spurred by a conversation with engineers, the author of this article explains what retries, backoff, and jitter can fix, and more importantly, when they won’t help.

  Tejas Ghadge — The New Stack

This is a juicy one, involving a routine credential roll gone bad, resulting in an outage in Cloudflare’s R2 service.

  Phillip Jones — Cloudflare

In this series of posts, we illustrate design considerations for a database system throttler, whose purpose is to keep the database system healthy overall. We discuss choice of metrics, granularity, behavior, impact, prioritization, and other topics.

Part 2 is here and part 3 is here.

  Shlomi Noach — Planetscale

I hadn’t heard the term “lurking variable” before, but I definitely know the concept. This article is a must-read for anyone troubleshooting tricky problems in production, and especially for earlier-career folks developing their skills.

  Teiva Harsanyi — The Coder Cafe

This article gives 4 strategies to better handle situations when database queries need to join across data residing in separate shards.

   Baskar Sikkayan — DZone

SRE Weekly Issue #469

A message from our sponsor, incident.io:

Speed isn’t everything. We studied 100K+ incidents to find out what actually makes for good incident management—from detection to follow-up. You can now view the recording of our latest live event to get even more info on the benchmarks, insights, and real-life examples from the report.

https://go.incident.io/events/going-beyond-mttx

I’ve shared this article before, but it’s so critical that it’s time to give it another read. MTTR is a statistically useless metric, and by using it, we will draw faulty conclusions and potentially take harmful actions. Courtney Nash does a really great job of laying out the science in an easy-to-understand way.

  Courtney Nash — Resilience in Software Foundation / The VOID

I like the analogy here: when we say people are components in or sociotechnical systems, system diagrams are like a form of cache.

  Clint Byrum

From Werner Vogels’s intro to this article:

Andy takes us through S3’s evolution from simple object store to sophisticated data platform, illustrating how customer feedback has shaped every aspect of the service. It’s a fascinating look at how we maintain simplicity even as systems scale to handle hundreds of trillions of objects.

  Andy Warfield — Amazon

Instead of a traditional Cost/Performance/Reliability trade-off, this article argues that serverless presents a tradeoff of Cost, Performance, and Complexity.

  Luc van Donkersgoed

Google uses System Theoretic Process Analysis to identify problems in their systems. They found that the most effective way to spread adoption of STPA was to build their own training program.

  Garrett Holthaus — Google

So far, I’m liking this new post series from Nextdoor about their efforts to scale their datastore. Here’s the first installment, about the things they’ve tried up to now.

I’ll share the rest of the series as I work my way through them.

  Slava Markeyev — Nextdoor

Wow, I had no idea EBS volumes failed this often!

  Nick Van Wiggeren — PlanetScale

A production of Tinker Tinker Tinker, LLC Frontier Theme