On Thursday, GCP had a major incident, returning 500 errors for many services worldwide. Click through for Google’s incident report.
Cloudflare’s KV service has a dependency on GCP, and Cloudflare posted this report on their incident.
Jeremy Hartman and CJ Desai — Cloudflare
Lorin Hochstein’s perspective on an incident report often makes me see things I didn’t in my first pass.
Lorin Hochstein
Should you escalate early or avoid pulling folks in unless absolutely necessary? This article goes into these questions and beyond, delving into the definition and purpose of escalation.
Hamed Silatani — Uptime Labs
How do we ensure the reliability of an LLM-based system? Can we apply traditional SRE principles and techniques to AI? This article gave me a lot to think about.
Denys Vasyliev — The New Stack
In this blog post, we’ll discuss our experiences in identifying the challenges associated with EC2 network throttling. We’ll also delve into how we developed network performance monitoring for the Pinterest EC2 fleet and discuss various techniques we implemented to manage network bursts, ensuring dependable network performance for our critical online serving workloads.
Jia Zhan and Sachin Holla — Pinterest
High Availability keeps things stable in small failures. DR is the safety net for large-scale disasters.
After explaining why HA by itself isn’t enough, this article covers strategies, costs, and best practices for disaster recovery.
Yakaiah Bommishetti — HackerNoon
This article explains how observability costs can ramp up quickly, especially if we’re not careful about what data we store.
There’s a lot of nuance here, and the author posted this followup the next day after receiving many responses.
Leon Adato