SRE Weekly Issue #481

Google Cloud Platform Incident, June 12, 2025

On Thursday, GCP had a major incident, returning 500 errors for many services worldwide. Click through for Google’s incident report.

Google

Cloudflare service outage June 12, 2025

Cloudflare’s KV service has a dependency on GCP, and Cloudflare posted this report on their incident.

Jeremy Hartman and CJ Desai — Cloudflare

Quick takes on the GCP public incident write-up

Lorin Hochstein’s perspective on an incident report often makes me see things I didn’t in my first pass.

Lorin Hochstein

Too Soon or Too Late: The Incident Escalation Dilemma

Should you escalate early or avoid pulling folks in unless absolutely necessary? This article goes into these questions and beyond, delving into the definition and purpose of escalation.

Hamed Silatani — Uptime Labs

AI Reliability Engineering: Welcome to the Third Age of SRE

How do we ensure the reliability of an LLM-based system? Can we apply traditional SRE principles and techniques to AI? This article gave me a lot to think about.

Denys Vasyliev — The New Stack

Handling Network Throttling with AWS EC2 at Pinterest

In this blog post, we’ll discuss our experiences in identifying the challenges associated with EC2 network throttling. We’ll also delve into how we developed network performance monitoring for the Pinterest EC2 fleet and discuss various techniques we implemented to manage network bursts, ensuring dependable network performance for our critical online serving workloads.

Jia Zhan and Sachin Holla — Pinterest

Beyond High Availability: Disaster Recovery Architectures That Keep Running When HA Fails

High Availability keeps things stable in small failures. DR is the safety net for large-scale disasters.

After explaining why HA by itself isn’t enough, this article covers strategies, costs, and best practices for disaster recovery.

Yakaiah Bommishetti — HackerNoon

Who the Hell is Going to Pay For This?

This article explains how observability costs can ramp up quickly, especially if we’re not careful about what data we store.

There’s a lot of nuance here, and the author posted this followup the next day after receiving many responses.

Leon Adato

SRE Weekly Issue #481

Subscribe

RSS

Mastodon

Search Issues

A message from our sponsor, PagerDuty:

Subscribe

RSS

Mastodon

Search Issues