SRE Weekly Issue #462

A message from our sponsor, incident.io:

On-call shouldn’t feel like a nightmare. With incident.io, you get clear ownership, seamless escalations, and insights that actually help—so you can fix issues fast and get back to what matters. No chaos, just smooth operations.

https://go.incident.io/on-call-as-it-should-be

This article series asks, do you really need ACID consistency?

Well, of course ACID consistency exists – and it is a good thing that it exists. Thus, feel free to call the post title clickbait … ;)

My point here is that it should not exist as functional requirement.

  Uwe Friedrichsen

OpenAI posted this mini report on their outage on January 30.

  OpenAI

It’s never DNS, except when it’s definitely DNS, such as in the case of this probable DNSSEC misconfiguration.

   Wilson Chua — Manila Bulletin

Do you want to prioritize availability or control?

  Teiva Harsanyi — The Coder Cafe

The amount of attention an incident gets is proportional to the severity of the incident: the greater the impact to the organization, the more attention that post-incident activities will get.

The problem is that the severity of a near-miss incident is zero, but it can have significant value for learning even still.

  Lorin Hochstein

This article urges caution in creating alerts that recommend a specific course of action when they fire. It explains why this can be dangerous and suggests alternative methods.

  Fred Hebert — Honeycomb

In this post, I will highlight some crucial Kubernetes best practices. They are from my years of experience with Kubernetes in production. Think of this as the curated “Kubernetes cheat sheet” you wish you had from Day 1.

  Engin Diri — Pulumi

Meta’s profiling system has helped them save thousands of servers’ worth of computing resources, through continuous profiling and centralized symbolization.

  Jordan Rome — Meta

SRE Weekly Issue #461

A message from our sponsor, incident.io:

Effective incident management demands coordination and collaboration to minimize disruptions. This guide by incident.io covers the full incident lifecycle—from preparation to improvement—emphasizing teamwork beyond engineering. By engineers, for engineers.

https://incident.io/guide

Written in 2020 after an AWS outage, this article analyzes dependence on third-party services and the responsibility to understand their reliability.

  Uwe Friedrichsen

When a cache expired, these folks found that their application stampeded the database with expensive queries, so they searched for a solution.

  Punit Sethi

When a high-severity incident happens, its associated risks becomes salient: the incident looms large in our mind, and the fact that it just happened leads us to believe that the risk of a similar incident is very high.

  Lorin Hochstein

These folks landed on a hybrid approach using two vendors, allowing them to avoid sending their entire trace volume to an expensive observability vendor.

  Jakub Sokół — monday

Under heavy load, requests are handled in LIFO order to maximize the chance of successfully completing fresh requests.

LIFO = Last In, First Out

  Teiva Harsanyi

More than just a simple feature comparison, this article also presents two use cases and analyzes which tool is best in each case.

   Josson Paul Kalapparambath — DZone

These folks explain why they use Go for everything: application code, infrastructure as code, tooling, and even as a wrapper around Helm charts for Kubernetes.

  Akhilesh Krishnan — Oodle AI

SRE Weekly Issue #460

A message from our sponsor, incident.io:

See how Netflix scaled their incident management with incident.io. By leveraging intuitive tools like Catalog and Workflows, they built a streamlined, scalable process that empowers teams to handle incidents with ease and consistency—even at Netflix’s scale.

https://incident.io/customers/netflix

So I bombed an incident review this week. More specifically, the facilitating.

I love how candid this article is. This kind of story is invaluable to level up our own retrospective facilitation skills.

  Will Gallego

It turns out that Google Cloud has a distributed tracing offering, and here’s an example of how to set it up.

  Punit Sethi

This article explains how 8 popular database systems use synchronized clocks. The systems covered include Spanner, DynamoDB, CockroachDB, and others.

  Murat

This article introduces the concept of a hot shard in a distributed system and outlines several strategies for alleviating it.

  Sid

Leap seconds can be really dangerous for IT systems! This article explains how the author eased their infrastructure through a leap second by smearing its effect across the preceding day.

  rachelbythebay

This article series revisits the underpinnings of the shift toward microservices, with a critical eye. My favorite bit is the analogy for microservice complexity in part 3.

  Uwe Friedrichsen

Catchpoint is back with their seventh annual SRE report, and you can download the PDF directly without having to register.

  Catchpoint

There are some real gems in here, including my favorite, death by yes.

SRE Weekly Issue #459

A message from our sponsor, incident.io:

Effective incident management demands coordination and collaboration to minimize disruptions. This guide by incident.io covers the full incident lifecycle—from preparation to improvement—emphasizing teamwork beyond engineering. By engineers, for engineers.

https://incident.io/guide

In a microservices environment, testing user journeys that span across multiple bounded contexts requires collaboration and a clear delineation of responsibilities.

  Yan Cui

These folks migrated from Fastly to Cloudflare using Terraform. They wrote a Go program to translate from their Fastly VCL configurations to an equivalent set of parameters to their Terraform module.

  hatappi1225 — Mercari

This 3-part series does a deep dive on how time and clocks work in distributed data stores. Part 2 is here and part 3 is here.

  Murat

TIL: “Unix time” (seconds since the epoch) does not include leap seconds.

  Kyle Kingsbury

This post argues that tech companies should avoid outages like Facebook’s in 2021 by using much more rigorous principles such as those used to build bridges. I’m not so sure about that, but it was an interesting read.

  Davi Ottenheimer

There’s a lot going on beneath the surface in a live video streaming service. Cloudflare walks us through it, including key design decisions like on-the-fly transcoding.

  Kyle Boutette and Jacob Curtis — Cloudflare

DSQL is Amazon’s new serverless PostgreSQL-compatible datastore.

Aurora DSQL is designed to remain available, durable, and strongly consistent even in the face of infrastructure failures and network partitions.

But what about the CAP Theorem? Click through to find out how.

  Marc Brooker

This new installment introduces the next level of resilience, which involves the ability to radically change your approach if the usual adaptation strategies fall short.

  Uwe Friedrichsen

SRE Weekly Issue #458

A message from our sponsor, incident.io:

Ever wonder how Netflix handles incidents at their scale? With incident.io, they’ve built a process that’s smooth, scalable, and keeps everyone on the same page. Tools like Catalog and Workflows make it intuitive for teams to tackle incidents consistently—no matter how big the challenge.

https://incident.io/customers/netflix

We can never see our systems directly, so we rely on “sensors” to understand the state of the system. What if the sensors are broken?

  Lorin Hochstein

Two super insightful observations about the nature of architectural work, well worth revisiting next time you’re making big design decisions.

So, “Two IMO relevant findings regarding architectural work” would probably be a more accurate title. But that would be a lot less catchy title … ;)

  Uwe Friedrichsen

To prevent revalidation stampedes, Cloudflare uses randomness to decide whether to send requests to the origin. Click through to find out how it works.

  Thibault Meunier — Cloudflare

Some problems with autoscaling, along with potential solutions.

   John Akkarakaran Jose — DZone

This article provides a detailed overview of the Incremental Migration with the Dual Write strategy, including the necessary steps, considerations, and best practices.

   Deepti Marrivada, Bal Reddy Cherlapally, and Spurthi Jambula — DZone

trying to build the perfect system that anticipates every future need is often worse than creating a system designed to change quickly.

I’ve experienced this firsthand as well. Even an architecture that was supposed to be static needed to change as requirements evolved.

  Simen A. W. Olsen — Pulumi

Using more reliable clocks with definite precision allows for significant performance improvements in distributed systems, as described in this article.

  Murat

This opinion piece argues that Snapshot Isolation is the “sweet spot” isolation level that is best for most applications.

  Marc Brooker

A production of Tinker Tinker Tinker, LLC Frontier Theme