The FCC has released a report on the major Level 3 outage in October of 2016. This summary article serves as a good TL;DR summary on what went wrong and includes a link to the full report.
Brian Santo — Fierce Telecom
They had an awesome approach: use RSpec to create a test suite of HTTP requests and run it continuously during the deployment to ensure that nothing changed from the end-user’s perspective. Bonus points for generating tests automatically.
Jacob Bednarz — Envato
Netflix reduced the time it takes to evacuate a failed AWS region from 50 minutes to just 8.
Luke Kosewski, Amjith Ramanujam, Niosha Behnam, Aaron Blohowiak, and Katharina Probst — Netflix
I don’t usually link to talks, but this talk transcript reads almost like an article, and it’s a good one. The premise: if you’re not monitoring well, then you can’t safely test in production. Scalyr found a few ways in which their monitoring showed cracks, and now they’re sharing it with us.
Steven Czerwinski — Scalyr
Design carefully, especially around retries, lest you create a thundering herd that makes it much harder to recover from an outage. That lesson and more, in this article on shooting yourself in the foot at web scale.
Benjamin Campbell — Business Computing World
Have I mentioned how much I love GitLab’s openness? Here’s how they handle on-call shift transitions in their remote-only organization.
John Jarvis — GitLab
What is the definition of a distributed system, and why are they difficult? I really love the definition in the second tweet.
I sure love a good troubleshooting story. This one has a pretty excellent failure mode, A+ investigative technique, and an emphasis on following something through until you find an answer.
This discussion of how and why to create a globally-distributed SRE team may only apply to bigger companies, but it’s got a lot of useful bits in it. I just have to stop laughing at the acronym “GD”…
Akhil Ahuja — LinkedIn