SRE Weekly Issue #460

A message from our sponsor, incident.io:

See how Netflix scaled their incident management with incident.io. By leveraging intuitive tools like Catalog and Workflows, they built a streamlined, scalable process that empowers teams to handle incidents with ease and consistency—even at Netflix’s scale.

https://incident.io/customers/netflix

So I bombed an incident review this week. More specifically, the facilitating.

I love how candid this article is. This kind of story is invaluable to level up our own retrospective facilitation skills.

  Will Gallego

It turns out that Google Cloud has a distributed tracing offering, and here’s an example of how to set it up.

  Punit Sethi

This article explains how 8 popular database systems use synchronized clocks. The systems covered include Spanner, DynamoDB, CockroachDB, and others.

  Murat

This article introduces the concept of a hot shard in a distributed system and outlines several strategies for alleviating it.

  Sid

Leap seconds can be really dangerous for IT systems! This article explains how the author eased their infrastructure through a leap second by smearing its effect across the preceding day.

  rachelbythebay

This article series revisits the underpinnings of the shift toward microservices, with a critical eye. My favorite bit is the analogy for microservice complexity in part 3.

  Uwe Friedrichsen

Catchpoint is back with their seventh annual SRE report, and you can download the PDF directly without having to register.

  Catchpoint

There are some real gems in here, including my favorite, death by yes.

Updated: January 19, 2025 — 9:58 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme