SRE Weekly Issue #114

SPONSOR MESSAGE

Why is design so important to data-driven teams, and what does it mean for observability? See what several experts have to say. http://try.victorops.com/SREWeekly/Observability

Articles

The FCC has released a report on the major Level 3 outage in October of 2016. This summary article serves as a good TL;DR summary on what went wrong and includes a link to the full report.

Brian Santo — Fierce Telecom

They had an awesome approach: use RSpec to create a test suite of HTTP requests and run it continuously during the deployment to ensure that nothing changed from the end-user’s perspective. Bonus points for generating tests automatically.

Jacob Bednarz — Envato

Netflix reduced the time it takes to evacuate a failed AWS region from 50 minutes to just 8.

Luke Kosewski, Amjith Ramanujam, Niosha Behnam, Aaron Blohowiak, and Katharina Probst — Netflix

I don’t usually link to talks, but this talk transcript reads almost like an article, and it’s a good one. The premise: if you’re not monitoring well, then you can’t safely test in production. Scalyr found a few ways in which their monitoring showed cracks, and now they’re sharing it with us.

Steven Czerwinski — Scalyr

Design carefully, especially around retries, lest you create a thundering herd that makes it much harder to recover from an outage. That lesson and more, in this article on shooting yourself in the foot at web scale.

Benjamin Campbell — Business Computing World

Have I mentioned how much I love GitLab’s openness? Here’s how they handle on-call shift transitions in their remote-only organization.

John Jarvis — GitLab

What is the definition of a distributed system, and why are they difficult? I really love the definition in the second tweet.

Charity Majors

I sure love a good troubleshooting story. This one has a pretty excellent failure mode, A+ investigative technique, and an emphasis on following something through until you find an answer.

Rachel Kroll

This discussion of how and why to create a globally-distributed SRE team may only apply to bigger companies, but it’s got a lot of useful bits in it. I just have to stop laughing at the acronym “GD”…

Akhil Ahuja — LinkedIn

Outages

Updated: March 18, 2018 — 8:30 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme