SRE Weekly Issue #492

r/sre: pagerduty went down and my day went straight to hell

Three days ago, PagerDuty had a major incident, severely impacting incident creation, notifications, and more. Linked above is a discussion on reddit’s r/sre with lots of takes on how folks deal with this kind of thing.

u/Secret-Menu-2121 and others

Being on the Same Page During an Incident: Not Actually Telepathy

It’s not telepathy; it’s about building common ground. This article explains what that means and the components that comprise common ground in an incident.

Stuart Rimell — Uptime Labs

Pooling Connections with RDS Proxy at Klaviyo

An introduction to database connection pooling in general, and RDS proxy in specific, complete with a Terraform snippet.

David Kraytsberg — Klaviyo

Easy will always trump simple

This article explores the difference between simple and easy, their relation to complexity, and the effect of production pressure.

Lorin Hochstein

Availability Models: Because “Highly Available” Isn’t Saying Much

What does “High Availability” actually mean? It turns out that it can mean different things to different people, and it’s important to look deeper.

Teiva Harsanyi — The Coder Cafe

Ron Gantt on incidents and blame

This short but sweet untitled LinkedIn post goes into the importance of understanding the entire context rather than focusing on an individual’s mistakes or omissions.

Ron Gantt

SLI Evolution Stages

Whether you’re just getting started implementing SLIs and SLOs or you’re a veteran, you’ll want to read this one. It charts the progress of organizations as they successively refine and mature their SLIs, and more importantly, it explains why the later stages matter.

Alex Ewerlöf

SRE Weekly Issue #492

Subscribe

RSS

Mastodon

Search Issues

A message from our sponsor, Observe, Inc.:

Subscribe

RSS

Mastodon

Search Issues