SRE WEEKLY – Page 6 – scalability, availability, incident response, automation

SRE Weekly Issue #458

lex

January 5, 2025

We can never see our systems directly, so we rely on “sensors” to understand the state of the system. What if the sensors are broken?

Lorin Hochstein

The laws of architectural work

Two super insightful observations about the nature of architectural work, well worth revisiting next time you’re making big design decisions.

So, “Two IMO relevant findings regarding architectural work” would probably be a more accurate title. But that would be a lot less catchy title … ;)

Uwe Friedrichsen

Sometimes I cache: implementing lock-free probabilistic caching

To prevent revalidation stampedes, Cloudflare uses randomness to decide whether to send requests to the origin. Click through to find out how it works.

Thibault Meunier — Cloudflare

Why EC2 Autoscaling Isn’t a Silver Bullet

Some problems with autoscaling, along with potential solutions.

John Akkarakaran Jose — DZone

Migration from RDS to DynamoDB With the Dual Write Strategy

This article provides a detailed overview of the Incremental Migration with the Dual Write strategy, including the necessary steps, considerations, and best practices.

Deepti Marrivada, Bal Reddy Cherlapally, and Spurthi Jambula — DZone

Your Perfect Infrastructure May Not Be So Perfect

trying to build the perfect system that anticipates every future need is often worse than creating a system designed to change quickly.

I’ve experienced this firsthand as well. Even an architecture that was supposed to be static needed to change as requirements evolved.

Simen A. W. Olsen — Pulumi

Utilizing highly synchronized clocks in distributed databases

Using more reliable clocks with definite precision allows for significant performance improvements in distributed systems, as described in this article.

Murat

Snapshot Isolation vs Serializability

This opinion piece argues that Snapshot Isolation is the “sweet spot” isolation level that is best for most applications.

Marc Brooker

SRE Weekly Issue #457

lex

December 29, 2024

General

Comments

View on sreweekly.com

Preventing Out-of-Memory (OOM) Kills in Kubernetes: Tips for Optimizing Container Memory Management

In this post, we’ll explore the reasons that OOM kills can occur and provide tactics to combat and prevent them.

Will Searle — Causely

The long way towards resilience – Part 7

The high-plateau of basic resilience is the third interim stop, companies tend to reach on their journey towards resilience.

I especially enjoyed the bit about how trying to add robustness can paradoxically diminish overall reliability, reminiscent of Lorin Hochstein and others.

Uwe Friedrichsen

Adding latency: one step, two step, oops

What happens when you move your DB and network latency goes from 0.5ms to 10ms? Time to find out by experimenting (carefully).

Lawrence Jones

How to support a growing Kubernetes cluster with a small etcd

I’ve only used Kubernetes under Amazon EKS, which handles running etcd, so this guide helped fill in some gaps in my knowledge. Of course, under EKS, you still need to pay attention to etcd.

David M. Lentz — Datadog

The Evolution of SRE at Google

Google folks share how they’ve applied System-Theoretic Accident Model and Processes (STAMP) to SRE at Google. This really stood out to me:

A design might implement its requirements flawlessly. But what if requirements necessary for the system to be safe were incorrect or, even worse, missing altogether?

Tim Falzone and Ben Treynor Sloss — USENIX ;login:

New Blog Series: RescueOps

Search and rescue (SAR) operations and incident response have striking similarities. In this series, Claire dives into lessons SREs can learn from wildfire management ICSs.

I really love learning about ICS from the veterans who use it for actual emergencies!

Claire Leverne — Rootly

The loneliness of the long distance runbook

Runbooks are programs for an imperfect execution engine of highly variable quality.

What happens when the runbook meets reality?

Jos Visser

Canva incident report: API Gateway outage

This is a really great one! Several factors combined to cause the outage, and they’re all laid out in juicy detail.

Brendan Humphreys — Canva

The Canva outage: another tale of saturation and resilience

Here’s Lorin Hochstein’s take on Canva’s outage report.

Lorin Hochstein

SRE Weekly Issue #456

lex

December 22, 2024

General

Comments

View on sreweekly.com

MTTR: When sample means and power laws combine, trouble follows

Here’s another way to use math to show that tracking MTTR over time is going to help you draw incorrect conclusions about your incident trends.

Lorin Hochstein

How Dropbox Saved Millions of Dollars by Building a Load Balancer

Why build your own? Dropbox had a heterogeneous fleet with differently-sized backends, and no load-balancer available at the time could handle that.

Richard Oliver Bray

The Incident Maturity Model

There’s so much here, I need to read it again a few times — and you should too. Their model has three stages of increasing maturity, allowing you to adopt it at the right pace for your org.

Stephen Whitworth — incident.io

Break Stuff on Purpose

After accidentally losing all of their Kibana dashboards, the folks at Slack implemented chaos engineering to detect similar problems early.

Sean Madden — Slack

LLMs won’t save us

This article raises concerns about using LLMs in production operations that I haven’t seen expressed quite in this way before.

Niall Murphy

New Production Readiness Check experience in Mercari

Five years ago, Mercari adopted a checklist for production readiness, and they’ve seen reliability improve as a result. Now they’re sharing how adoption has gone and the impact it’s had on development teams and what they’re doing about it.

mshibuya — Mercari

Google Cloud incident report: Bigquery incident on December 4, 2024

They deleted an internal project that held API keys that were still in use.

Google

Uptime, status pages, and transparency calculus

A status page can be about so much more than just informing customers of downtime. It’s a marketing artifact, evidence for SLA breach, a sales pitch, and more.

Lawrence Jones

SRE Weekly Issue #455

lex

December 15, 2024

General

Comments

View on sreweekly.com

How to Handle Sudden Bursts of Traffic or “Thundering Herd Problem”?

This article has 6 methods to mitigate thundering herd problems, including pretty diagrams with each.

Sid

The“Second Victim” and Beatitude

Some thoughts on the “second victim” concept. As a note, I was one of the participants in the discussion on which this article is based.

Fractal Flame

Building on Shaky Ground

Written in response to a question about the big CrowdStrike outage earlier this year, this article asks: do we need to start using safer languages?

Kode Vicious — ACM Queue

How we seamlessly migrated high volume real-time streaming traffic from one service to another with zero data loss and duplication

This one used a cool technique I haven’t seen yet: they hardcoded a cutoff time into the old and new systems, so they both automatically cut over simultaneously.

Md Riyadh, Jia Long Loh, Muqi Li, and Pu Li — Grab

The flight plan that brought UK airspace to its knees

Here’s a great writeup of a problem with the UK flight system involving a latent bug. Among several cool takeaways, I really liked the way the official incident report didn’t try to pretend this weird bug could have been foreseen and prevented.

Chris Evans — incident.io

When Game Days go wrong

This game day ended up way more serious than intended and exposed a serious Kubernetes configuration flaw, causing a real outage. Oops!

Lawrence Jones

How using Availability Zones can eat up your budget — our journey from Prometheus to VictoriaMetrics

It’s all fun and games until someone accidentally uses too much DTAZ (data transfer between availability zones). Good monitoring story, too!

Grzegorz Skołyszewski — Prezi

API, ChatGPT & Sora Facing Issues

OpenAI posted this writeup of an incident earlier this week. They tried to deploy detailed monitoring for their Kubernetes cluster, but the monitoring system overloaded the Kubernetes API.

OpenAI

Quick takes on the recent OpenAI public incident write-up

And here’s Lorin Hochstein’s analysis of OpenAI’s incident writeup, including a recurring theme:

This is a great example of unexpected behavior of a subsystem whose primary purpose was to improve reliability.

Lorin Hochstein

SRE Weekly Issue #454

lex

December 8, 2024

General

Comments

View on sreweekly.com

Nine entire years ago, I threw together a few “issues” with my favorite SRE articles, installed WordPress, and added a subscription form, with no clue what I was doing. It’s only thanks to you folks, the thousands of subscribers and the many authors of great SRE content, that I’ve been able to keep this up for so long. Thank you, you make it fun! And as always, thanks also to my sponsors, former, current, and future, who’ve helped make this whole thing possible.

TTR: the out-of-control metric

When we try to optimize MTTR as if it’s a meaningful statistic, we run into trouble. This article does a great job of explaining why, drawing from concepts and techniques in manufacturing.

Lorin Hochstein

The Case for Shared Storage

This article introduces the concepts of “shared nothing” and “shared storage” in distributed systems and then explains why they chose shared storage for WarpStream.

Richard Artoul — WarpStream

Do Dollars Make Sense for Incident Management?

How much did that incident cost in lost revenue? This article says you should avoid including that number in your incident management process, because it’s a trap.

Tom Webster — Rootly

Breaking down CPU speed: How utilization impacts performance

Pushing a system to 100% CPU utilization can cause workloads to be slowed down. This article is about experimentally finding the sweet spot between utilizing CPUs as much as possible and avoiding performance issues.

Andreas Strikos — GitHub

Solutions to the Lost Update Problem

This article has a couple of strategies for handling concurrent updates to the same row in MySQL, with and without locking.

Sönke Ruempler

How we page ourselves if incident.io goes down

They do it with a dead man’s switch, implemented using a backup alert provider.

Lawrence Jones — incident.io

The long way towards resilience – Part 6

I came across part 6 first and I need to go back and read the rest, but I just had to share this now, because if the cool concept it contains: that efficiency and resiliency are at odds with each other.

Uwe Friedrichsen

Keeping User Journey SLOs Up-to-Date with E2E Testing in a Microservices Architecture

This is so cool! Their system automatically figures out which API calls are critical to each user journey and keeps the list updated.

yakenji — Mercari

SRE Weekly Issue #458

SRE Weekly Issue #457

SRE Weekly Issue #456

SRE Weekly Issue #455

SRE Weekly Issue #454

Subscribe

RSS

Mastodon

Search Issues

A message from our sponsor, incident.io:

A message from our sponsor, FireHydrant:

A message from our sponsor, FireHydrant:

A message from our sponsor, FireHydrant:

A message from our sponsor, FireHydrant:

Subscribe

RSS

Mastodon

Search Issues