Providing incident resolution times to customers is an unneeded stress for responders with very little gain.
Robert Ross — FireHydrant
I can’t tell you how many times I’ve found myself lost in thought, wondering how something like EBS works. While this isn’t an architecture overview, it does contain a bunch of juicy tidbits. I especially like the bit about the value of a “full stack engineer”.
Marc Olson — All Things Distributed
This article explains how to use eBPF to gather observability data, including an example eBPF program and instructions on how to run it.
Kranthi Kiran Erusu — DZone
Netflix uses multiple kinds of data stores. It was difficult for developers to manage the differences between data stores, so they wrote an abstraction layer.
Our goal was to build a versatile and efficient data storage solution that could handle a wide variety of use cases, ranging from the simplest hashmaps to more complex data structures, all while ensuring high availability, tunable consistency, and low latency.
Vidhya Arvind, Rajasekhar Ummadisetty, Joey Lynch, and Vinay Chella — Netflix
This post looks at the challenges of predicting capacity in a global CDN, including dealing with uncertainties in customer growth, traffic routing, hardware failure, and more.
Curt Robords — Cloudflare
GitHub tells us about the tools they use to improve reliability and performance, including Scientist and Flipper.
Nick Hengeveld — GitHub
If you’re heavily action-item-oriented like I used to be, this is a great read to get you thinking down a different path.
My coworker wrote this awesome script to update our various @team-oncall
aliases in Slack automatically, following our PagerDuty on-call schedule. This one thing has already saved us so much in the way of toil, frustration, and missed notifications!
Fred Hebert — Honeycomb
Full disclosure: Honeycomb is my employer.