SRE Weekly Issue #472

In this part of the Scaling Nextdoor’s Datastores blog series, we will see how the Core-Services team at Nextdoor keeps its cache consistent with database updates and avoids stale writes to the cache.

Ronak Shah — Nextdoor

Going beyond MTTx and measuring “good” incident management

Okay, if we’re not supposed to use MTTR, what metrics in incident response are better?

Chris Evans — incident.io

This article is published by my sponsor, incident.io, but their sponsorship did not influence its inclusion in this issue.

Things that go wrong with disk IO

This reminds me of the Fallacies of Distributed Computing, and it’s equally important to internalize. Disk I/O isn’t guaranteed.

Phil Eaton

Critical Step MISSED! | What Happened on Jet2 Flight 2152?!

Here’s a great example of how we can learn a ton from near misses. In this airplane incident, a slight change in the normal takeoff sequence resulted in missing a critical step. As a result of this near miss, the aviation industry still instituted changes to make this kind of problem less likely.

Mentour Pilot — YouTube

(Un)coupling in distributed systems – Part 2

In this second and final post of this little blog series, we will discuss the redundancy fallacy and the 3rd type of coupling, we need to consider in the context of remote communication, which is temporal coupling.

Uwe Friedrichsen

Model error

All of our systems have embedded models of the world. What happens when these models are wrong?

Lorin Hochstein

Three Guiding Lights on Sustaining Resilience

This article answers this question:

“If we had to choose just three things to sustain a resilient, healthy reliability culture, what would they be?”

with these three things:

Know what matters to your users, and make it really visible

Create Psychological Safety Around Failure

Let incidents update your mental models

Busra Koken

Hot Take: I Want Execs Closer to Incidents, Not Farther

Execs intruding in incidents can have a disruptive effect, which this article acknowledges with specific examples. It goes on to list some concrete and useful things execs can do to support incident response.

By the way, massive props to the Uptime Labs folks. They created an RSS feed for their blog at my request with a super-fast turnaround. Incredible!

Hamed Silatani — Uptime Labs

SRE Weekly Issue #472

Subscribe

RSS

Mastodon

Search Issues

A message from our sponsor, incident.io:

Subscribe

RSS

Mastodon

Search Issues