SRE Weekly Issue #453

Cloudflare incident on November 14, 2024, resulting in lost logs

It’s a case of cascading failure, but with an interesting twist: their system was designed to handle floods but the safety mechanism was left unconfigured.

Jamie Herre, Tom Walwyn, Christian ndres, Gabriele Viglianisi, Mik Kocikowski, and Rian van der Merwe — Cloudflare

Quick takes on the latest Cloudflare public incident write-up

Lorin takes apart the Cloudflare write-up with style, including a really insightful section on safety mechanisms in complex systems.

Lorin Hochstein

How Meta built large-scale cryptographic monitoring

Meta wanted to log details about the encrypted communications in their systems to help track key use, outdated algorithms, and the like. It’s a ton of telemetry, so they did smart sampling (which they call aggregation):

During the aggregation, a “count” is maintained for every unique event. When it comes time to flush, this count is exported along with the log to convey how often that particular event took place.

Hussain Humadi, Sasha Frolov, Rafael Misoczki, Dong Wu — Meta

Go Profiling in Production

A primer on using Golang’s profiling tools including CPU profiling, memory profiling, goroutine leak analysis, and execution tracing.

Gaurav Maheshwari — Oodle

Local Optimizations Don’t Lead to Global Optimums

A thought-provoking piece of automation, friction, and adaptive capacity. I especially enjoyed the section on decompensation.

Fred Hebert

Understanding Timings in Distributed Systems

With various tools for different kinds of telemetry, these folks needed to up their game to be able to fully understand what happened in a customer request. They also needed a custom sampling strategy to make sure they didn’t miss anything important.

Martin Fahy — Klaviyo

Ably’s four pillars: no scale ceiling

we’ll be looking at how Ably’s platform achieves scalability, and how, as a result, there’s no effective ceiling on the scale of applications that can be supported.

Paddy Byers — Ably

Building a User Signals Platform at Airbnb

Airbnb built a system for tracking and analyzing user actions to aid in personalization. Their system uses Flink and Kafka to handle over a million events per second.

Kidai Kwon — Airbnb

SRE Weekly Issue #453

Subscribe

RSS

Mastodon

Search Issues

A message from our sponsor, FireHydrant:

Subscribe

RSS

Mastodon

Search Issues