Three days ago, PagerDuty had a major incident, severely impacting incident creation, notifications, and more. Linked above is a discussion on reddit’s r/sre with lots of takes on how folks deal with this kind of thing.
u/Secret-Menu-2121 and others
It’s not telepathy; it’s about building common ground. This article explains what that means and the components that comprise common ground in an incident.
Stuart Rimell — Uptime Labs
An introduction to database connection pooling in general, and RDS proxy in specific, complete with a Terraform snippet.
David Kraytsberg — Klaviyo
This article explores the difference between simple and easy, their relation to complexity, and the effect of production pressure.
Lorin Hochstein
What does “High Availability” actually mean? It turns out that it can mean different things to different people, and it’s important to look deeper.
Teiva Harsanyi — The Coder Cafe
This short but sweet untitled LinkedIn post goes into the importance of understanding the entire context rather than focusing on an individual’s mistakes or omissions.
Ron Gantt
Whether you’re just getting started implementing SLIs and SLOs or you’re a veteran, you’ll want to read this one. It charts the progress of organizations as they successively refine and mature their SLIs, and more importantly, it explains why the later stages matter.
Alex Ewerlöf