SRE Weekly Issue #499

The folks at Uptime Labs and Advanced Capacity Labs have announced an advent calendar for this December.

Note: In order to take part, you’ll need to provide an email address to subscribe. I gave that some serious thought before including this here, but ultimately, I have a lot of trust for the folks at both ACL and Uptime Labs, since they’ve both produced so much awesome content that’s been featured here. I’m interested to see what this collab will bring!

Uptime Labs and Adaptive Capacity Labs

From Static Rate Limiting to Adaptive Traffic Management in Airbnb’s Key-Value Store

Cool trick: divide short-term P95 latency by the long-term P95 to detect load spikes and adjust rate limits on-the-fly.

Shravan Gaonkar — Airbnb

Failure is inevitable: Learning from a large outage, and building for reliability in depth at Datadog

Datadog shares the bigger-picture lessons they learned and improvements they made since their major 2023 outage, including an emphasis on graceful degradation.

Laura de Vesine, Rob Thomas, AND Maciej Kowalewski

Why we’re leaving serverless

This article does a really good job of laying out the problems with serverless that led them to leave: having to layer on significant complexity to deal with the limits of running in Cloudflare workers.

Andreas Thomas — Unkey

Reliability and Fault Tolerance

This article explains the two concepts of reliability and fault tolerance and how they relate.

Oakley Hall

r/sre: Today I caused a production incident with a stupid bug

This one could easily be titled, “Today, major system failures meant that I was able to take down production really easily.” There’s some great discussion in the comments, and I hope the author feels better.

u/Deep-Jellyfish-2383 and others — reddit

Advancing Our Chef Infrastructure: Safety Without Disruption

Slack shows how they changed their monolithic Chef cookbook change deployment process to reduce risk, by breaking production up into 6 separate environments.

Archie Gunasekara — Slack

You’ll never see attrition referenced in an RCA

The author discusses reasons why engineer attrition won’t appear in a public incident write-up, and may well not appear in a private one, either.

Lorin Hochstein

SRE Weekly Issue #499

Subscribe

RSS

Mastodon

Search Issues