SRE Weekly Issue #114

Articles

Level 3 technician’s misstep causes largest outage ever reported

The FCC has released a report on the major Level 3 outage in October of 2016. This summary article serves as a good TL;DR summary on what went wrong and includes a link to the full report.

Brian Santo — Fierce Telecom

Migrating edge network providers

They had an awesome approach: use RSpec to create a test suite of HTTP requests and run it continuously during the deployment to ensure that nothing changed from the end-user’s perspective. Bonus points for generating tests automatically.

Jacob Bednarz — Envato

Project Nimble: Region Evacuation Reimagined

Netflix reduced the time it takes to evacuate a failed AWS region from 50 minutes to just 8.

Luke Kosewski, Amjith Ramanujam, Niosha Behnam, Aaron Blohowiak, and Katharina Probst — Netflix

Tonight We Monitor, For Tomorrow, We Test in Production!

I don’t usually link to talks, but this talk transcript reads almost like an article, and it’s a good one. The premise: if you’re not monitoring well, then you can’t safely test in production. Scalyr found a few ways in which their monitoring showed cracks, and now they’re sharing it with us.

Steven Czerwinski — Scalyr

Oopsy DDoSy: Accidental DDoS Attacks Causing Major Grief

Design carefully, especially around retries, lest you create a thundering herd that makes it much harder to recover from an outage. That lesson and more, in this article on shooting yourself in the foot at web scale.

Benjamin Campbell — Business Computing World

How our production team runs the weekly on-call handover

Have I mentioned how much I love GitLab’s openness? Here’s how they handle on-call shift transitions in their remote-only organization.

John Jarvis — GitLab

Twitter: Charity Majors on distributed systems, complexity, and microservices

What is the definition of a distributed system, and why are they difficult? I really love the definition in the second tweet.

Charity Majors

Troubleshooting IPv6 badness to certain hosts in a rack

I sure love a good troubleshooting story. This one has a pretty excellent failure mode, A+ investigative technique, and an emphasis on following something through until you find an answer.

Rachel Kroll

The Makeup of Successful Geographically-Distributed SRE Teams: Part 1 | LinkedIn Engineering

This discussion of how and why to create a globally-distributed SRE team may only apply to bigger companies, but it’s got a lot of useful bits in it. I just have to stop laughing at the acronym “GD”…

Akhil Ahuja — LinkedIn

Outages

DoubleClick (ad provider)
- DoubleClick went down, and it took a lot of sites with it. Click through for Catchpoint’s excellent analysis.
  Kameerath Kareem — Catchpoint
Travis CI
SmartThings (IoT platform)
Air Canada
Netflix

SRE Weekly Issue #114

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

SPONSOR MESSAGE

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues