SRE Weekly Issue #492

A message from our sponsor, Observe, Inc.:

Built on a scalable, cost-efficient data lake, Observe delivers AI-powered observability at scale. With its context-aware Knowledge Graph and AI SRE, Observe enables Capital One, Topgolf, and Dialpad to ingest hundreds of terabytes daily and resolve issues faster—at drastically lower cost.

Learn how Observe is redefining observability for the AI era.

Three days ago, PagerDuty had a major incident, severely impacting incident creation, notifications, and more. Linked above is a discussion on reddit’s r/sre with lots of takes on how folks deal with this kind of thing.

  u/Secret-Menu-2121 and others

It’s not telepathy; it’s about building common ground. This article explains what that means and the components that comprise common ground in an incident.

  Stuart Rimell — Uptime Labs

An introduction to database connection pooling in general, and RDS proxy in specific, complete with a Terraform snippet.

  David Kraytsberg — Klaviyo

This article explores the difference between simple and easy, their relation to complexity, and the effect of production pressure.

  Lorin Hochstein

What does “High Availability” actually mean? It turns out that it can mean different things to different people, and it’s important to look deeper.

  Teiva Harsanyi — The Coder Cafe

This short but sweet untitled LinkedIn post goes into the importance of understanding the entire context rather than focusing on an individual’s mistakes or omissions.

  Ron Gantt

Whether you’re just getting started implementing SLIs and SLOs or you’re a veteran, you’ll want to read this one. It charts the progress of organizations as they successively refine and mature their SLIs, and more importantly, it explains why the later stages matter.

  Alex Ewerlöf

Updated: August 31, 2025 — 9:47 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme