[…] the fix isn’t “train your engineers to write better status updates.” The fix is to stop asking your engineers to write them, and start asking the right people instead.
Brent Chapman
A satisfying scaling story where every fix came from looking more closely at the system — Kafka head-of-line blocking, a clumpy scheduler, and an active-active API that silently doubled latency for half of all partitions.
Dave Baxter — Cloudflare
Some good examples of risks in here, along with an interesting tendency to blame “user error”.
Prakshal Doshi — HackerNoon
Satellites present unique reliability constraints like limited data uplink windows and the risk of bricking a very expensive piece of equipment.
Author:
This looks fun! It’s a free virtual event on July 8.
Uptime Labs
This article does a really great job of building up an explanation of feedback-based control and the difference between edge-triggered and level-triggered systems.
Fatih Arslan — PlanetScale
An open letter to software researchers to study incident response in software systems. It’s so cool how the author translates incident response concepts to researchers who may not be familiar, with examples.
Lorin Hochstein
An important concept: a user’s perception of your average outage duration is weighted and won’t match a flat average MTTR.
Marc Brooker
