the idea that the highest ROI for risk reduction work is in the highest severity incidents is not a fact, it’s a hypothesis that simply isn’t supported by data.
Lorin Hochstein
Incidents are bad, so should we try to have fewer of them? This article challenges the assumptions contained within that goal and suggests other ways to frame one’s thinking.
Hamed Silatani — Uptime Labs.
This guide goes deeply into the details of how Prometheus uses memory, and then it shows you how to get a handle on it.
Vladimir Guryanov — Palark
This article discusses the DNS-related challenges encountered at Mercari on our Kubernetes clusters and the significant improvements achieved by implementing Node-Local DNS Cache.
Satyadarshi Sanu — Mercari
In this post we’ll explore the fundamentals of distributed consensus, compare the dominant consensus algorithms Paxos and Raft, and examine recent implementations like Kafka Raft.
Narendra Reddy Sanikommu — DEV
A discussion of two techniques the folks at Cash App used to improve their reliability: adopting a two-cluster topology with Kubernetes, and using Amazon’s Fault Injection Service to simulate the failure of an availability zone.
Dustin Ellis, Deepak Garg, Ben Apprederisse, Jan Zantinge, and Rachel Sheikh — Amazon
Reading this one taught me a couple of techniques I wasn’t aware of for finding queries in need of optimization in MySQL.
Vinicius Grippa — Readyset
Ouch — and a great learning opportunity for all of us:
When our backend circuit breakers triggered, aggressive websocket reconnect logic initiated on every connected client at once, further overwhelming an already stressed database.
Jake Cooper — Railway