SRE Weekly Issue #146

A message from our sponsor, VictorOps:

Automation can be used to help classify incident severity and route alerts to the right person or team. Learn how SRE teams are leveraging a refined incident classification and alert routing process to improve system reliability:

http://try.victorops.com/sreweekly/classifying-incident-severity

Articles

NRE Labs is a no-strings-attached, community-centered initiative to bring the skills of automation within reach for everyone. Through short, simple exercises, all right here in the browser, you can learn the tools, skills, and processes that will put you on the path to becoming a Network Reliability Engineer.

Tips on designing your on-call to be fair to the humans involved, including gems like an automatic day off after a middle-of-the-night page.

David Mytton — StackPath

GitHub’s major outage stemmed from a brief cut in connectivity between two of their data centers.

Errata: Last week I mentioned the possibility of a network cut and cited an article about GitHub’s database architecture. I should have credited @dbaops, who made the connection.

Rumors of undocumented packet rate limits in EC2 abound, and I’ve personally run afoul of them. Backed by direct experimentation, this article unmasks the limits.

Matthew Barlocker — Blue Matador

This sounds an awful lot like those packet rate limits from the previous article…

Chris McFadden — SparkPost

Ever hear of that traffic intersection where they took out all of the signs, and suddenly everyone drove more safely? Woolworth’s tried a similar experiment with their stores, with interesting results.

Sidney Dekker — Safety Differently

Find out how they discovered the bug and what they did about it. Required reading if you use gRPC, since in some cases it falls to obey timeouts.

Ciaran Gaffney and Fran Garcia — Hosted Graphite

when we sit with a team to plan the experiment, that is when the light goes on… they start realising how many things they missed and they start cataloging what bad things could happen if something goes bad…

Russ Miles — ChaosIQ

Outages

Updated: November 4, 2018 — 7:49 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme