SRE Weekly Issue #499

The folks at Uptime Labs and Advanced Capacity Labs have announced an advent calendar for this December.

Note: In order to take part, you’ll need to provide an email address to subscribe. I gave that some serious thought before including this here, but ultimately, I have a lot of trust for the folks at both ACL and Uptime Labs, since they’ve both produced so much awesome content that’s been featured here. I’m interested to see what this collab will bring!

  Uptime Labs and Adaptive Capacity Labs

Cool trick: divide short-term P95 latency by the long-term P95 to detect load spikes and adjust rate limits on-the-fly.

  Shravan Gaonkar — Airbnb

Datadog shares the bigger-picture lessons they learned and improvements they made since their major 2023 outage, including an emphasis on graceful degradation.

  Laura de Vesine, Rob Thomas, AND Maciej Kowalewski

This article does a really good job of laying out the problems with serverless that led them to leave: having to layer on significant complexity to deal with the limits of running in Cloudflare workers.

  Andreas Thomas — Unkey

This article explains the two concepts of reliability and fault tolerance and how they relate.

  Oakley Hall

This one could easily be titled, “Today, major system failures meant that I was able to take down production really easily.” There’s some great discussion in the comments, and I hope the author feels better.

  u/Deep-Jellyfish-2383 and others — reddit

Slack shows how they changed their monolithic Chef cookbook change deployment process to reduce risk, by breaking production up into 6 separate environments.

  Archie Gunasekara — Slack

The author discusses reasons why engineer attrition won’t appear in a public incident write-up, and may well not appear in a private one, either.

  Lorin Hochstein

Updated: November 30, 2025 — 9:56 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme