It’s a case of cascading failure, but with an interesting twist: their system was designed to handle floods but the safety mechanism was left unconfigured.
Jamie Herre, Tom Walwyn, Christian ndres, Gabriele Viglianisi, Mik Kocikowski, and Rian van der Merwe — Cloudflare
Lorin takes apart the Cloudflare write-up with style, including a really insightful section on safety mechanisms in complex systems.
Lorin Hochstein
Meta wanted to log details about the encrypted communications in their systems to help track key use, outdated algorithms, and the like. It’s a ton of telemetry, so they did smart sampling (which they call aggregation):
During the aggregation, a “count” is maintained for every unique event. When it comes time to flush, this count is exported along with the log to convey how often that particular event took place.
Hussain Humadi, Sasha Frolov, Rafael Misoczki, Dong Wu — Meta
A primer on using Golang’s profiling tools including CPU profiling, memory profiling, goroutine leak analysis, and execution tracing.
Gaurav Maheshwari — Oodle
A thought-provoking piece of automation, friction, and adaptive capacity. I especially enjoyed the section on decompensation.
Fred Hebert
With various tools for different kinds of telemetry, these folks needed to up their game to be able to fully understand what happened in a customer request. They also needed a custom sampling strategy to make sure they didn’t miss anything important.
Martin Fahy — Klaviyo
we’ll be looking at how Ably’s platform achieves scalability, and how, as a result, there’s no effective ceiling on the scale of applications that can be supported.
Paddy Byers — Ably
Airbnb built a system for tracking and analyzing user actions to aid in personalization. Their system uses Flink and Kafka to handle over a million events per second.
Kidai Kwon — Airbnb