This is really neat! They’ve developed a new approach to search that uses 3-letter “trigrams” rather than tokenizing words, making it especially well-suited to code search. It converts regular expressions to trigram searches behind the scenes.
Dmitry Gruzd — GitLab
This article about LLMs is by a regularly featured author here in the newsletter. It’s not, strictly speaking, directly SRE-related, but I really got a lot out of it, so I’m including it anyway.
Lorin Hochstein
This one explains the difference between a soft and hard dependency, why it matters, and how to use this information to improve reliability. I like the section on soft dependencies evolving into hard dependencies when you’re not looking.
Teiva Harsanyi — The Coder Cafe
In this post, we’ll walk through how we’re splitting apart our shared database into independently owned instances. We’ll explain how we defined the right boundaries, minimized risk during migrations, and built the tooling to make the process safe and scalable.
Fabiana Scala and Tali Gutman — Datadog
At some point, the external dependencies which our systems rely on become so tightly coupled, large, and fundamental that should those foundations inevitably fail, that blame can actually go down in response to an incident.
This thought-provoking article explores why we’re more tolerant of outages from large tech companies like Google Cloud or Salesforce, and what this means for how we think about reliability engineering and incident response.
Will Gallego
This practical guide shows how to use AWS Fault Injection Service (FIS) to perform chaos engineering experiments on self-managed Cassandra clusters. It walks through setting up experiments to test node failure scenarios and validate that applications can properly handle database outages through connection pooling and retry mechanisms.
Hans Nesbitt and Lwanga Phillip — AWS
Klaviyo shares how they built an automated recovery system to handle billing usage tracking failures. The system uses S3 for data storage and SQS for message queuing to ensure that missed usage events are automatically recovered, eliminating manual intervention and reducing customer confusion.
Kaavya Antony — Klaviyo