SRE Weekly Issue #453

A message from our sponsor, FireHydrant:

Why migrate from PagerDuty? Empower team-level ownership, reduce costs, decouple alerts from incidents, automate incidents end-to-end…to name a few. Join the growing list of companies that have made the switch. (p.s. our Signals migrator makes it simple)

https://firehydrant.com/migrate/from-pagerduty/

It’s a case of cascading failure, but with an interesting twist: their system was designed to handle floods but the safety mechanism was left unconfigured.

   Jamie Herre, Tom Walwyn, Christian ndres, Gabriele Viglianisi, Mik Kocikowski, and Rian van der Merwe — Cloudflare

Lorin takes apart the Cloudflare write-up with style, including a really insightful section on safety mechanisms in complex systems.

  Lorin Hochstein

Meta wanted to log details about the encrypted communications in their systems to help track key use, outdated algorithms, and the like. It’s a ton of telemetry, so they did smart sampling (which they call aggregation):

During the aggregation, a “count” is maintained for every unique event. When it comes time to flush, this count is exported along with the log to convey how often that particular event took place.

  Hussain Humadi, Sasha Frolov, Rafael Misoczki, Dong Wu — Meta

A primer on using Golang’s profiling tools including CPU profiling, memory profiling, goroutine leak analysis, and execution tracing.

  Gaurav Maheshwari — Oodle

A thought-provoking piece of automation, friction, and adaptive capacity. I especially enjoyed the section on decompensation.

  Fred Hebert

With various tools for different kinds of telemetry, these folks needed to up their game to be able to fully understand what happened in a customer request. They also needed a custom sampling strategy to make sure they didn’t miss anything important.

  Martin Fahy — Klaviyo

we’ll be looking at how Ably’s platform achieves scalability, and how, as a result, there’s no effective ceiling on the scale of applications that can be supported.

  Paddy Byers — Ably

Airbnb built a system for tracking and analyzing user actions to aid in personalization. Their system uses Flink and Kafka to handle over a million events per second.

  Kidai Kwon — Airbnb

Updated: December 1, 2024 — 9:09 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme