Gremlin Inc. helps folks simulate failure, but what happens when they turn their tools on their own infrastructure? In this article, they share all sorts of juicy details about how they set up their experiments, what they hoped to prove and thought might happen, and then what actually happened, including an unexpected failure mode.
This article series isn’t actually about writing your own new distributed log from scratch — probably not a good idea. It’s about learning the fundamental principles involved in designing such systems so that we can better understand them while operating and using them.
What do you do about the scary system that nobody touches and everyone is afraid will fall over some day? This article shows you a concrete plan for digging in and dealing with the skeleton in the closet.
It’s Julia Evans, writing at Stripe!
In this post, we’ll explain why we chose to build on top of Kubernetes. We’ll examine how we integrated Kubernetes into our existing infrastructure, our approach to building confidence in (and improving) our Kubernetes’ cluster’s reliability, and the abstractions we’ve built on top of Kubernetes.
AppOptics’s take on alerting, including this gem:
More often, our metric choices and threshold values are guided by our preexisting tools. Hence, if our tools cannot measure latency, we do not alert on latency.
How many times have you seen a migration or transition reach 90% completion and stall? This SysAdvent author urges caution in engaging a “hybrid cloud” vendor solution.
Juniper discusses the evolution of the Network Engineer role into Network Reliability Engineer (NRE).
Just like sysadmins have graduated from technicians to technologists as SREs, the NRE title is a declaration of a new culture and serves as the zenith for all that we do and have as engineers of network invincibility.
A primer on setting up load testing for WebDAV using Apache Jmeter.
An interesting debugging story involving a tricky data corruption bug in RavenDB.