SRE Weekly Issue #117

SPONSOR MESSAGE

“If it ain’t broke—let’s break it, fix it, then break it again, then fix it again.” Read more about making your SRE team(s) more proactive through chaos engineering: http://try.victorops.com/proactive-sre

Articles

Brilliant, just brilliant. This isn’t just another “there isn’t just one root cause” article to skip over. The author takes time to explain the concept with cogent examples and useful metaphors. This one really caught my eye:

What’s the root cause of success?
[…] When building a successful project, there’s never just one thing that goes right for it to succeed.

Will Gallego

This episode of Food Fight is an hour-long interview with guests Rob Schnepp, Ron Vidal, and Chris Hawley, the 3 firefighters behind Blackrock 3 Partners. It’s a great intro to the Incident Management System, and well worth a listen.

Shout-out to Maple Player, an android audio player with a really high-quality tempo increase feature. I was able to listen at 1.5x speed and still understand everything; otherwise, I wouldn’t have had time this week.

Nell Shamrell-Harrington and Nathen Harvey

Here’s one from the archives, an incident report from 2013. After a temporary network partition in a redis cluster, the replicas all tried to resynchronize at once, overloading the master. One of the results was that some customers got repeatedly charged for the same thing.

Twilio

You have to design a system such that the natural thing to do yields a good result and doesn’t put anyone in harm’s way.

Rachel Kroll

I thought consistent hashing was largely solved. I was wrong! There are some good solutions out there, but you have to evaluate their relative trade-offs and pick the right one for your use case.

Damian Gryski

Full disclosure: Damian Gryski is my coworker at Fastly.

As you read this article, consider the ethical imperative of system reliability, when system reliability can literally mean life and death in some cases. That’s only going to be more common in the coming years.

Yonatan Zunger

Our service needs to be available 24/7, without question. In order to ensure this happens, the LogicMonitor TechOps team uses HashiCorp Packer, Terraform, and Consul to dynamically build infrastructure for disaster recovery (DR) in a reliable and sustainable way.

Randall Thomson — LogicMonitor

On Tuesday, 13 March 2018 at 12:04 UTC a database query was accidentally run against our production database which truncated all tables.

Oof. Sorry, Travis folks, but a sincere thanks for sharing your experience with us.

Konstantin Haase — Travis CI

I like these “preliminary results” better than the kinds of aggregate statistics you normally get from a survey report. There are real quotes from free-form survey answers, including a couple of real gems. There’s a link to download the actual survey report if you’re into that, too.

Dawn Parzych — Catchpoint

Outages

Updated: April 8, 2018 — 8:41 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme