SRE Weekly Issue #463

Sometimes, we can harness randomness to improve throughput and reliability.

Teiva Harsanyi — The Coder Cafe

How We Migrated Checkly From Heroku to AWS

Not just the “how”, but also the “why”, along with the challenges they found along the way.

Daniel Paulus and Umut Uzgur — Checkly

Using ML to detect and respond to performance degradations in slices of Stripe payments

It’s a classic problem: how do you detect problems that badly impact a specific set of customers, when the overall percentage affected is tiny?

Lakshmi Narayan and Joshua Delman — Stripe

What is the Byzantine Generals Problem in Distributed Systems?

This is the clearest and most concise explanation of the Byzantine Generals Problem that I’ve read.

Sid — The Scalable Thread

Simulation: An Underutilized Tool in Distributed Systems

Th[is] article describes some different methods and tools that engineers can use to simulate their clusters and what knowledge they can gain from it, and it presents a case study using SimKube, the Kubernetes simulator developed by Applied Computing Research Labs in 2024.

David R. Morrison — ACM Queue

Incident Report: December 16th, 2024

An IaaC nightmare: when a list went from having IPs to being empty, suddenly the IP block rule was interpreted as “block everything” rather than “block nothing”.

Jake Cooper — Railway

Cloudflare incident on February 6, 2025

The incident occurred due to human error and insufficient validation safeguards during a routine abuse remediation for a report about a phishing site hosted on R2.

Matt Silverlock and Javier Castro — Cloudflare

Why DOGE’s meddling at Treasury could have catastrophic consequences for the US economy

Along with being blatantly illegal, DOGE’s actions are incredibly risky from a reliability perspective. Thanks, Liz, for putting into words concerns that I also share.

Liz Fong-Jones — Bulletin of the Atomic Scientists

SRE Weekly Issue #463

Subscribe

RSS

Mastodon

Search Issues

A message from our sponsor, incident.io:

Subscribe

RSS

Mastodon

Search Issues