SRE WEEKLY – Page 14 – scalability, availability, incident response, automation

SRE Weekly Issue #461

lex

January 26, 2025

The importance of resilience engineering

Written in 2020 after an AWS outage, this article analyzes dependence on third-party services and the responsibility to understand their reliability.

Uwe Friedrichsen

How we invalidate cache for resource-heavy & long-running requests

When a cache expired, these folks found that their application stampeded the database with expensive queries, so they searched for a solution.

Punit Sethi

The danger of overreaction

When a high-severity incident happens, its associated risks becomes salient: the incident looms large in our mind, and the fact that it just happened leads us to believe that the risk of a similar incident is very high.

Lorin Hochstein

Managing Trace Volume at monday.com

These folks landed on a hybrid approach using two vendors, allowing them to avoid sending their entire trace volume to an expensive observability vendor.

Jakub Sokół — monday

Adaptive LIFO

Under heavy load, requests are handled in LIFO order to maximize the chance of successfully completing fresh requests.

LIFO = Last In, First Out

Teiva Harsanyi

Kafka vs NATS: A Comparison for Message Processing

More than just a simple feature comparison, this article also presents two use cases and analyzes which tool is best in each case.

Josson Paul Kalapparambath — DZone

Go All the Way: Why Golang is Your Swiss Army Knife for Modern Development

These folks explain why they use Go for everything: application code, infrastructure as code, tooling, and even as a wrapper around Helm charts for Kubernetes.

Akhilesh Krishnan — Oodle AI

SRE Weekly Issue #460

lex

January 19, 2025

General

Comments

View on sreweekly.com

An Incident Review of an Incident Review

So I bombed an incident review this week. More specifically, the facilitating.

I love how candid this article is. This kind of story is invaluable to level up our own retrospective facilitation skills.

Will Gallego

How we built observability with Google Cloud services

It turns out that Google Cloud has a distributed tracing offering, and here’s an example of how to set it up.

Punit Sethi

Use of Time in Distributed Databases (part 4): Synchronized clocks in production databases

This article explains how 8 popular database systems use synchronized clocks. The systems covered include Spanner, DynamoDB, CockroachDB, and others.

Murat

How to Handle Hot Shard Problem?

This article introduces the concept of a hot shard in a distributed system and outlines several strategies for alleviating it.

Sid

Pushing the whole company into the past on purpose

Leap seconds can be really dangerous for IT systems! This article explains how the author eased their infrastructure through a leap second by smearing its effect across the preceding day.

rachelbythebay

The microservices fallacy

This article series revisits the underpinnings of the shift toward microservices, with a critical eye. My favorite bit is the analogy for microservice complexity in part 3.

Uwe Friedrichsen

The SRE Report 2025

Catchpoint is back with their seventh annual SRE report, and you can download the PDF directly without having to register.

Catchpoint

r/sre: What’s the most bizarre root cause you’ve ever seen?

There are some real gems in here, including my favorite, death by yes.

SRE Weekly Issue #459

lex

January 12, 2025

General

Comments

View on sreweekly.com

How to end-to-end test microservices across bounded contexts?

In a microservices environment, testing user journeys that span across multiple bounded contexts requires collaboration and a clear delineation of responsibilities.

Yan Cui

A smooth CDN provider migration and future initiatives

These folks migrated from Fastly to Cloudflare using Terraform. They wrote a Go program to translate from their Fastly VCL configurations to an equivalent set of parameters to their Terraform module.

hatappi1225 — Mercari

Use of Time in Distributed Databases (part 1)

This 3-part series does a deep dive on how time and clocks work in distributed data stores. Part 2 is here and part 3 is here.

Murat

Seconds Since the Epoch

TIL: “Unix time” (seconds since the epoch) does not include leap seconds.

Kyle Kingsbury

Facebook Engineering Disasters Are Not Inevitable: Moving Past Casual Commentary to Real Change

This post argues that tech companies should avoid outages like Facebook’s in 2021 by using much more rigorous principles such as those used to build bridges. I’m not so sure about that, but it was an interesting read.

Davi Ottenheimer

Behind the scenes with Stream Live, Cloudflare’s live streaming service

There’s a lot going on beneath the surface in a live video streaming service. Cloudflare walks us through it, including key design decisions like on-the-fly transcoding.

Kyle Boutette and Jacob Curtis — Cloudflare

DSQL Vignette: Wait! Isn’t That Impossible?

DSQL is Amazon’s new serverless PostgreSQL-compatible datastore.

Aurora DSQL is designed to remain available, durable, and strongly consistent even in the face of infrastructure failures and network partitions.

But what about the CAP Theorem? Click through to find out how.

Marc Brooker

The long way towards resilience

This new installment introduces the next level of resilience, which involves the ability to radically change your approach if the usual adaptation strategies fall short.

Uwe Friedrichsen

SRE Weekly Issue #458

lex

January 5, 2025

General

Comments

View on sreweekly.com

Your lying virtual eyes

We can never see our systems directly, so we rely on “sensors” to understand the state of the system. What if the sensors are broken?

Lorin Hochstein

The laws of architectural work

Two super insightful observations about the nature of architectural work, well worth revisiting next time you’re making big design decisions.

So, “Two IMO relevant findings regarding architectural work” would probably be a more accurate title. But that would be a lot less catchy title … ;)

Uwe Friedrichsen

Sometimes I cache: implementing lock-free probabilistic caching

To prevent revalidation stampedes, Cloudflare uses randomness to decide whether to send requests to the origin. Click through to find out how it works.

Thibault Meunier — Cloudflare

Why EC2 Autoscaling Isn’t a Silver Bullet

Some problems with autoscaling, along with potential solutions.

John Akkarakaran Jose — DZone

Migration from RDS to DynamoDB With the Dual Write Strategy

This article provides a detailed overview of the Incremental Migration with the Dual Write strategy, including the necessary steps, considerations, and best practices.

Deepti Marrivada, Bal Reddy Cherlapally, and Spurthi Jambula — DZone

Your Perfect Infrastructure May Not Be So Perfect

trying to build the perfect system that anticipates every future need is often worse than creating a system designed to change quickly.

I’ve experienced this firsthand as well. Even an architecture that was supposed to be static needed to change as requirements evolved.

Simen A. W. Olsen — Pulumi

Utilizing highly synchronized clocks in distributed databases

Using more reliable clocks with definite precision allows for significant performance improvements in distributed systems, as described in this article.

Murat

Snapshot Isolation vs Serializability

This opinion piece argues that Snapshot Isolation is the “sweet spot” isolation level that is best for most applications.

Marc Brooker

SRE Weekly Issue #457

lex

December 29, 2024

General

Comments

View on sreweekly.com

Preventing Out-of-Memory (OOM) Kills in Kubernetes: Tips for Optimizing Container Memory Management

In this post, we’ll explore the reasons that OOM kills can occur and provide tactics to combat and prevent them.

Will Searle — Causely

The long way towards resilience – Part 7

The high-plateau of basic resilience is the third interim stop, companies tend to reach on their journey towards resilience.

I especially enjoyed the bit about how trying to add robustness can paradoxically diminish overall reliability, reminiscent of Lorin Hochstein and others.

Uwe Friedrichsen

Adding latency: one step, two step, oops

What happens when you move your DB and network latency goes from 0.5ms to 10ms? Time to find out by experimenting (carefully).

Lawrence Jones

How to support a growing Kubernetes cluster with a small etcd

I’ve only used Kubernetes under Amazon EKS, which handles running etcd, so this guide helped fill in some gaps in my knowledge. Of course, under EKS, you still need to pay attention to etcd.

David M. Lentz — Datadog

The Evolution of SRE at Google

Google folks share how they’ve applied System-Theoretic Accident Model and Processes (STAMP) to SRE at Google. This really stood out to me:

A design might implement its requirements flawlessly. But what if requirements necessary for the system to be safe were incorrect or, even worse, missing altogether?

Tim Falzone and Ben Treynor Sloss — USENIX ;login:

New Blog Series: RescueOps

Search and rescue (SAR) operations and incident response have striking similarities. In this series, Claire dives into lessons SREs can learn from wildfire management ICSs.

I really love learning about ICS from the veterans who use it for actual emergencies!

Claire Leverne — Rootly

The loneliness of the long distance runbook

Runbooks are programs for an imperfect execution engine of highly variable quality.

What happens when the runbook meets reality?

Jos Visser

Canva incident report: API Gateway outage

This is a really great one! Several factors combined to cause the outage, and they’re all laid out in juicy detail.

Brendan Humphreys — Canva

The Canva outage: another tale of saturation and resilience

Here’s Lorin Hochstein’s take on Canva’s outage report.

Lorin Hochstein

SRE Weekly Issue #461

SRE Weekly Issue #460

SRE Weekly Issue #459

SRE Weekly Issue #458

SRE Weekly Issue #457

Subscribe

RSS

Mastodon

Search Issues

A message from our sponsor, incident.io:

A message from our sponsor, incident.io:

A message from our sponsor, incident.io:

A message from our sponsor, incident.io:

A message from our sponsor, FireHydrant:

Subscribe

RSS

Mastodon

Search Issues