SRE Weekly Issue #459

A message from our sponsor, incident.io:

Effective incident management demands coordination and collaboration to minimize disruptions. This guide by incident.io covers the full incident lifecycle—from preparation to improvement—emphasizing teamwork beyond engineering. By engineers, for engineers.

https://incident.io/guide

In a microservices environment, testing user journeys that span across multiple bounded contexts requires collaboration and a clear delineation of responsibilities.

  Yan Cui

These folks migrated from Fastly to Cloudflare using Terraform. They wrote a Go program to translate from their Fastly VCL configurations to an equivalent set of parameters to their Terraform module.

  hatappi1225 — Mercari

This 3-part series does a deep dive on how time and clocks work in distributed data stores. Part 2 is here and part 3 is here.

  Murat

TIL: “Unix time” (seconds since the epoch) does not include leap seconds.

  Kyle Kingsbury

This post argues that tech companies should avoid outages like Facebook’s in 2021 by using much more rigorous principles such as those used to build bridges. I’m not so sure about that, but it was an interesting read.

  Davi Ottenheimer

There’s a lot going on beneath the surface in a live video streaming service. Cloudflare walks us through it, including key design decisions like on-the-fly transcoding.

  Kyle Boutette and Jacob Curtis — Cloudflare

DSQL is Amazon’s new serverless PostgreSQL-compatible datastore.

Aurora DSQL is designed to remain available, durable, and strongly consistent even in the face of infrastructure failures and network partitions.

But what about the CAP Theorem? Click through to find out how.

  Marc Brooker

This new installment introduces the next level of resilience, which involves the ability to radically change your approach if the usual adaptation strategies fall short.

  Uwe Friedrichsen

Updated: January 12, 2025 — 9:10 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme