SRE Weekly Issue #103

Articles

Gremlin Inc. helps folks simulate failure, but what happens when they turn their tools on their own infrastructure? In this article, they share all sorts of juicy details about how they set up their experiments, what they hoped to prove and thought might happen, and then what actually happened, including an unexpected failure mode.

Building a Distributed Log from Scratch, Part 1: Storage Mechanics

This article series isn’t actually about writing your own new distributed log from scratch — probably not a good idea. It’s about learning the fundamental principles involved in designing such systems so that we can better understand them while operating and using them.

sysadvent: Day 21 – Lighting Up Your Haunted Graveyards

What do you do about the scary system that nobody touches and everyone is afraid will fall over some day? This article shows you a concrete plan for digging in and dealing with the skeleton in the closet.

Learning to operate Kubernetes reliably

It’s Julia Evans, writing at Stripe!

In this post, we’ll explain why we chose to build on top of Kubernetes. We’ll examine how we integrated Kubernetes into our existing infrastructure, our approach to building confidence in (and improving) our Kubernetes’ cluster’s reliability, and the abstractions we’ve built on top of Kubernetes.

No Need to Be Alarmed: Crafting an Effective Alert Strategy

AppOptics’s take on alerting, including this gem:

More often, our metric choices and threshold values are guided by our preexisting tools. Hence, if our tools cannot measure latency, we do not alert on latency.

sysadvent: Day 17 – Don’t Fall for the Hybrid Cloud Trap

How many times have you seen a migration or transition reach 90% completion and stall? This SysAdvent author urges caution in engaging a “hybrid cloud” vendor solution.

2018 and the Dawn of Network Reliability Engineering (NRE)

Juniper discusses the evolution of the Network Engineer role into Network Reliability Engineer (NRE).

Just like sysadmins have graduated from technicians to technologists as SREs, the NRE title is a declaration of a new culture and serves as the zenith for all that we do and have as engineers of network invincibility.

Load Testing WebDAV Servers

A primer on setting up load testing for WebDAV using Apache Jmeter.

Production postmortem: data corruption, a view from INSIDE the sausage

An interesting debugging story involving a tricky data corruption bug in RavenDB.

SRE Weekly Issue #103

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

SPONSOR MESSAGE

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues