This article gives an overview of database consistency models and introduces the PACELC Theorem.
A primer on memory and resource leaks, including some lesser-known causes.
How can you troubleshoot a broken pod when it’s built FROM scratch and you can’t even run a shell in it?
Full disclosure: Honeycomb is my employer.
This article explains why reliability isn’t just a one-off project that you can bolt on and move on.
Gavin Cahill — Gremlin
DoorDash wanted consistent observability across their infrastructure that didn’t depend on instrumenting each application. To solve this, they developed BPFAgent, and this article explains how.
Patrick Rogers — DoorDash
Mean time to innocence is the average elapsed time between when a system problem is detected and any given team’s ability to say the team or part of its system is not the root cause of the problem.
This article, of course, is about not having a culture like that.
John Burke — TechTarget
It was the DB — more specifically, it was a DB migration with unintended locking.
Casey Huang — Pulumi
The incident stemmed from a control plane change that worked in some regions but caused OOMs in others.