SRE Weekly Issue #129

SPONSOR MESSAGE

Aggregate monitoring techniques alongside time series data can improve overall system visibility and reliability. Take SRE to the next level with these aggregate monitoring methods:

http://try.victorops.com/SREWeekly/Aggregate-Monitoring

Articles

What do you do when your hosts have kernel crashes at random every day? It turns out that you don’t need to be a seasoned kernel programmer to find a solution.

Pavlos Parissis — Booking.com

This is my first introduction tcpconnect (part of BCC). Pretty nifty!

fREW Schmidt

At Facebook, […] It is simply too difficult to rewrite caching/admission/eviction policies and other manually tuned heuristics by hand. We have to fundamentally change how we think about software maintenance.

Vladimir Bychkovsky, Jim Cipar, Alvin Wen, Lili Hu, and Saurav Mohapatra — Facebook

A couple weeks back, I linked to a postmortem template. Here’s a gameday report template from the same author.

Michael Kehoe

I had a really hard time choosing whether to include this one. On the one hand, it’s a really interesting article about service discovery in franchises that has to work right every time. On the other hand, Chick-fil-A has a terrible track record on GLBT rights, and I can’t overlook that.

Ultimately, I’m choosing to link to this article for its educational content, but I urge you to join me as I continue to boycott Chick-fil-A.

Brian Chambers, Caleb Hurd, and Alex Crane — Chick-fil-A

At 9 years old, this may be the oldest article I’ve linked to, but it’s worth it. The analogy to a home mortage is spot on.

Eric Lee

Click through to read about an interesting monitoring challenge and an account of how they solved it. I appreciate the emphasis on the importance of educating engineers to spread the knowledge of how the new system works among more people.

Joy Zheng and Jeeyoung Kim — Plaid

Another chaos engineering introduction. Why should you read it? If nothing else, the architecture diagram with the skull and cobwebs on it is pretty great. It’s also well worth reading if you’re looking to create a chaos engineering game plan.

Benjamin Wilms — Codecentric

Sometimes, a reliability risk can come in the form of a bunch of angry customers.

Ben Kuchera — Ars Technica

Outages

Updated: July 8, 2018 — 9:29 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme