In this part of the Scaling Nextdoor’s Datastores blog series, we will see how the Core-Services team at Nextdoor keeps its cache consistent with database updates and avoids stale writes to the cache.
Ronak Shah — Nextdoor
Okay, if we’re not supposed to use MTTR, what metrics in incident response are better?
Chris Evans — incident.io
This article is published by my sponsor, incident.io, but their sponsorship did not influence its inclusion in this issue.
This reminds me of the Fallacies of Distributed Computing, and it’s equally important to internalize. Disk I/O isn’t guaranteed.
Phil Eaton
Here’s a great example of how we can learn a ton from near misses. In this airplane incident, a slight change in the normal takeoff sequence resulted in missing a critical step. As a result of this near miss, the aviation industry still instituted changes to make this kind of problem less likely.
Mentour Pilot — YouTube
In this second and final post of this little blog series, we will discuss the redundancy fallacy and the 3rd type of coupling, we need to consider in the context of remote communication, which is temporal coupling.
Uwe Friedrichsen
All of our systems have embedded models of the world. What happens when these models are wrong?
Lorin Hochstein
This article answers this question:
“If we had to choose just three things to sustain a resilient, healthy reliability culture, what would they be?”
with these three things:
- Know what matters to your users, and make it really visible
- Create Psychological Safety Around Failure
- Let incidents update your mental models
Busra Koken
Execs intruding in incidents can have a disruptive effect, which this article acknowledges with specific examples. It goes on to list some concrete and useful things execs can do to support incident response.
By the way, massive props to the Uptime Labs folks. They created an RSS feed for their blog at my request with a super-fast turnaround. Incredible!
Hamed Silatani — Uptime Labs