SRE Weekly Issue #455

How to Handle Sudden Bursts of Traffic or “Thundering Herd Problem”?

This article has 6 methods to mitigate thundering herd problems, including pretty diagrams with each.

Sid

Some thoughts on the “second victim” concept. As a note, I was one of the participants in the discussion on which this article is based.

Fractal Flame

Building on Shaky Ground

Written in response to a question about the big CrowdStrike outage earlier this year, this article asks: do we need to start using safer languages?

Kode Vicious — ACM Queue

How we seamlessly migrated high volume real-time streaming traffic from one service to another with zero data loss and duplication

This one used a cool technique I haven’t seen yet: they hardcoded a cutoff time into the old and new systems, so they both automatically cut over simultaneously.

Md Riyadh, Jia Long Loh, Muqi Li, and Pu Li — Grab

The flight plan that brought UK airspace to its knees

Here’s a great writeup of a problem with the UK flight system involving a latent bug. Among several cool takeaways, I really liked the way the official incident report didn’t try to pretend this weird bug could have been foreseen and prevented.

Chris Evans — incident.io

When Game Days go wrong

This game day ended up way more serious than intended and exposed a serious Kubernetes configuration flaw, causing a real outage. Oops!

Lawrence Jones

How using Availability Zones can eat up your budget — our journey from Prometheus to VictoriaMetrics

It’s all fun and games until someone accidentally uses too much DTAZ (data transfer between availability zones). Good monitoring story, too!

Grzegorz Skołyszewski — Prezi

API, ChatGPT & Sora Facing Issues

OpenAI posted this writeup of an incident earlier this week. They tried to deploy detailed monitoring for their Kubernetes cluster, but the monitoring system overloaded the Kubernetes API.

OpenAI

Quick takes on the recent OpenAI public incident write-up

And here’s Lorin Hochstein’s analysis of OpenAI’s incident writeup, including a recurring theme:

This is a great example of unexpected behavior of a subsystem whose primary purpose was to improve reliability.

Lorin Hochstein

SRE Weekly Issue #455

Subscribe

RSS

Mastodon

Search Issues

A message from our sponsor, FireHydrant:

Subscribe

RSS

Mastodon

Search Issues