SRE Weekly Issue #480

You can’t prevent your last outage, no matter how hard you try

the idea that the highest ROI for risk reduction work is in the highest severity incidents is not a fact, it’s a hypothesis that simply isn’t supported by data.

Lorin Hochstein

Is Fewer Incidents Always Good?

Incidents are bad, so should we try to have fewer of them? This article challenges the assumptions contained within that goal and suggests other ways to frame one’s thinking.

Hamed Silatani — Uptime Labs.

Understanding and optimizing resource consumption in Prometheus

This guide goes deeply into the details of how Prometheus uses memory, and then it shows you how to get a handle on it.

Vladimir Guryanov — Palark

From DNS Failures to Resilience: How NodeLocal DNSCache Saved the Day

This article discusses the DNS-related challenges encountered at Mercari on our Kubernetes clusters and the significant improvements achieved by implementing Node-Local DNS Cache.

Satyadarshi Sanu — Mercari

Paxos vs. Raft and Modern Implementations

In this post we’ll explore the fundamentals of distributed consensus, compare the dominant consensus algorithms Paxos and Raft, and examine recent implementations like Kafka Raft.

Narendra Reddy Sanikommu — DEV

Improving platform resilience at Cash App

A discussion of two techniques the folks at Cash App used to improve their reliability: adopting a two-cluster topology with Kubernetes, and using Amazon’s Fault Injection Service to simulate the failure of an availability zone.

Dustin Ellis, Deepak Garg, Ben Apprederisse, Jan Zantinge, and Rachel Sheikh — Amazon

Identifying Cacheable Queries: Using tools like pt-query-digest or the MySQL sys schema to pinpoint queries that would benefit from caching

Reading this one taught me a couple of techniques I wasn’t aware of for finding queries in need of optimization in MySQL.

Vinicius Grippa — Readyset

Incident Report: June 6th, 2025

Ouch — and a great learning opportunity for all of us:

When our backend circuit breakers triggered, aggressive websocket reconnect logic initiated on every connected client at once, further overwhelming an already stressed database.

Jake Cooper — Railway

SRE Weekly Issue #480

Subscribe

RSS

Mastodon

Search Issues

A message from our sponsor, PagerDuty:

Subscribe

RSS

Mastodon

Search Issues