SRE Weekly Issue #463

A message from our sponsor, incident.io:

Incidents move fast—so should your response. That’s why we’re building an AI responder that thinks like your team, not a machine. See how we’re doing it, the challenges faced, and what else is on the AI roadmap.

https://www.youtube.com/watch?v=rNpwZPOUhuE

Sometimes, we can harness randomness to improve throughput and reliability.

  Teiva Harsanyi — The Coder Cafe

Not just the “how”, but also the “why”, along with the challenges they found along the way.

  Daniel Paulus and Umut Uzgur — Checkly

It’s a classic problem: how do you detect problems that badly impact a specific set of customers, when the overall percentage affected is tiny?

  Lakshmi Narayan and Joshua Delman — Stripe

This is the clearest and most concise explanation of the Byzantine Generals Problem that I’ve read.

  Sid — The Scalable Thread

Th[is] article describes some different methods and tools that engineers can use to simulate their clusters and what knowledge they can gain from it, and it presents a case study using SimKube, the Kubernetes simulator developed by Applied Computing Research Labs in 2024.

  David R. Morrison — ACM Queue

An IaaC nightmare: when a list went from having IPs to being empty, suddenly the IP block rule was interpreted as “block everything” rather than “block nothing”.

  Jake Cooper — Railway

The incident occurred due to human error and insufficient validation safeguards during a routine abuse remediation for a report about a phishing site hosted on R2.

  Matt Silverlock and Javier Castro — Cloudflare

Along with being blatantly illegal, DOGE’s actions are incredibly risky from a reliability perspective. Thanks, Liz, for putting into words concerns that I also share.

  Liz Fong-Jones — Bulletin of the Atomic Scientists

Updated: February 9, 2025 — 9:17 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme