SRE Weekly Issue #480

A message from our sponsor, PagerDuty:

🔍 Notable PagerDuty shift: Full incident management now spans all paid tiers. The upgraded Slack-first and Teams-first experience means fewer tools to juggle during incidents. Only leveraging PagerDuty for basic alerting? Time to check out what’s newly available in your plan!

https://fnf.dev/4dZ5V36

the idea that the highest ROI for risk reduction work is in the highest severity incidents is not a fact, it’s a hypothesis that simply isn’t supported by data.

  Lorin Hochstein

Incidents are bad, so should we try to have fewer of them? This article challenges the assumptions contained within that goal and suggests other ways to frame one’s thinking.

  Hamed Silatani — Uptime Labs.

This guide goes deeply into the details of how Prometheus uses memory, and then it shows you how to get a handle on it.

  Vladimir Guryanov — Palark

This article discusses the DNS-related challenges encountered at Mercari on our Kubernetes clusters and the significant improvements achieved by implementing Node-Local DNS Cache.

  Satyadarshi Sanu — Mercari

In this post we’ll explore the fundamentals of distributed consensus, compare the dominant consensus algorithms Paxos and Raft, and examine recent implementations like Kafka Raft.

  Narendra Reddy Sanikommu — DEV

A discussion of two techniques the folks at Cash App used to improve their reliability: adopting a two-cluster topology with Kubernetes, and using Amazon’s Fault Injection Service to simulate the failure of an availability zone.

  Dustin Ellis, Deepak Garg, Ben Apprederisse, Jan Zantinge, and Rachel Sheikh — Amazon

Reading this one taught me a couple of techniques I wasn’t aware of for finding queries in need of optimization in MySQL.

  Vinicius Grippa — Readyset

Ouch — and a great learning opportunity for all of us:

When our backend circuit breakers triggered, aggressive websocket reconnect logic initiated on every connected client at once, further overwhelming an already stressed database.

  Jake Cooper — Railway

Updated: June 8, 2025 — 10:25 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme