SRE Weekly Issue #472

A message from our sponsor, incident.io:

We’ve just raised $62M at incident.io to build AI agents that resolve incidents with you. See how we’re pioneering a new era of incident management.

https://go.incident.io/blog/incident.io-raises-62m

In this part of the Scaling Nextdoor’s Datastores blog series, we will see how the Core-Services team at Nextdoor keeps its cache consistent with database updates and avoids stale writes to the cache.

  Ronak Shah — Nextdoor

Okay, if we’re not supposed to use MTTR, what metrics in incident response are better?

  Chris Evans — incident.io

  This article is published by my sponsor, incident.io, but their sponsorship did not influence its inclusion in this issue.

This reminds me of the Fallacies of Distributed Computing, and it’s equally important to internalize. Disk I/O isn’t guaranteed.

  Phil Eaton

Here’s a great example of how we can learn a ton from near misses. In this airplane incident, a slight change in the normal takeoff sequence resulted in missing a critical step. As a result of this near miss, the aviation industry still instituted changes to make this kind of problem less likely.

  Mentour Pilot — YouTube

In this second and final post of this little blog series, we will discuss the redundancy fallacy and the 3rd type of coupling, we need to consider in the context of remote communication, which is temporal coupling.

  Uwe Friedrichsen

All of our systems have embedded models of the world. What happens when these models are wrong?

  Lorin Hochstein

This article answers this question:

“If we had to choose just three things to sustain a resilient, healthy reliability culture, what would they be?”

with these three things:

  1. Know what matters to your users, and make it really visible
  2. Create Psychological Safety Around Failure
  3. Let incidents update your mental models

  Busra Koken

Execs intruding in incidents can have a disruptive effect, which this article acknowledges with specific examples. It goes on to list some concrete and useful things execs can do to support incident response.

By the way, massive props to the Uptime Labs folks. They created an RSS feed for their blog at my request with a super-fast turnaround. Incredible!

  Hamed Silatani — Uptime Labs

Updated: April 13, 2025 — 9:52 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme