SRE Weekly Issue #481

A message from our sponsor, PagerDuty:

Need Slack-native E2E incident management? PagerDuty delivers! Automatic incident workflows that set up Slack channels? ✅ Incident roles and built-in commands? ✅ AI-powered chat that provides real-time customer impact? ✅ Now available on ALL paid PagerDuty plans.

https://fnf.dev/4dZ5V36

On Thursday, GCP had a major incident, returning 500 errors for many services worldwide. Click through for Google’s incident report.

  Google

Cloudflare’s KV service has a dependency on GCP, and Cloudflare posted this report on their incident.

  Jeremy Hartman and CJ Desai — Cloudflare

Lorin Hochstein’s perspective on an incident report often makes me see things I didn’t in my first pass.

  Lorin Hochstein

Should you escalate early or avoid pulling folks in unless absolutely necessary? This article goes into these questions and beyond, delving into the definition and purpose of escalation.

  Hamed Silatani — Uptime Labs

How do we ensure the reliability of an LLM-based system? Can we apply traditional SRE principles and techniques to AI? This article gave me a lot to think about.

  Denys Vasyliev — The New Stack

In this blog post, we’ll discuss our experiences in identifying the challenges associated with EC2 network throttling. We’ll also delve into how we developed network performance monitoring for the Pinterest EC2 fleet and discuss various techniques we implemented to manage network bursts, ensuring dependable network performance for our critical online serving workloads.

  Jia Zhan and Sachin Holla — Pinterest

High Availability keeps things stable in small failures. DR is the safety net for large-scale disasters.

After explaining why HA by itself isn’t enough, this article covers strategies, costs, and best practices for disaster recovery.

   Yakaiah Bommishetti — HackerNoon

This article explains how observability costs can ramp up quickly, especially if we’re not careful about what data we store.

There’s a lot of nuance here, and the author posted this followup the next day after receiving many responses.

   Leon Adato

Updated: June 15, 2025 — 9:22 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme