In this final installment of the Scaling Nextdoor’s Datastores blog series, we detail how the Core-Services team at Nextdoor solved cache consistency challenges as part of a holistic approach to improve our database and cache scalability and usability.
I really enjoyed this whole series. Thanks, Nextdoor folks!
Slava Markeyev — Nextdoor
These folks analyzed a non-production incident like it was production, including retrospective analysis and lessons learned. Best part: they share the juicy details with us!
Joe Mckevitt — UptimeLabs
This one goes over several different models you can use to implement on-call compensation, with pros and cons for each.
Constant Fischer — PagerDuty
This article shows that MySQL’s CATS algorithm offers only a small performance gain over FIFO once deadlock logging interference is removed.
My jaw involuntarily opened when I saw the graph after they commented out the logging print statements.
Bin Wang — DZone
In this article, I’ll walk you through how we implemented chaos engineering across our stack using Chaos Toolkit, Chaos Monkey, and Istio — with hands-on examples for Java and Node.js. If you’re exploring ways to strengthen system resilience, this guide is packed with practical insights you can apply today.
The author does not appear to have a tie to Istio. This article has a ton of code snippets to help you get started.
Prabhu Chinnasamy — DZone
In this blog, we’ll look at three important facts about serverless reliability that teams often overlook. We’ll explain what they are, what the risks are of not addressing them, and how you can make your serverless applications more fault-tolerant.
- Serverless architectures don’t guarantee reliability.
- You do have control over serverless reliability.
- Serverless reliability practices can benefit all platforms, not just serverless platforms.
Andre Newman — Gremlin
This Golang debugging story is a really satisfying read.
The heap profiles were very effective at telling us the allocation sites of live objects, but provided no insights into why specific objects were being retained.
Ella Chao — WarpStream
Zoom had an outage this week when its domain zoom.us
was temporarily blocked at the TLD level due to a miscommunication between its registrar and the TLD.
Zoom