NRE Labs is a no-strings-attached, community-centered initiative to bring the skills of automation within reach for everyone. Through short, simple exercises, all right here in the browser, you can learn the tools, skills, and processes that will put you on the path to becoming a Network Reliability Engineer.
Tips on designing your on-call to be fair to the humans involved, including gems like an automatic day off after a middle-of-the-night page.
David Mytton — StackPath
GitHub’s major outage stemmed from a brief cut in connectivity between two of their data centers.
Errata: Last week I mentioned the possibility of a network cut and cited an article about GitHub’s database architecture. I should have credited @dbaops, who made the connection.
Rumors of undocumented packet rate limits in EC2 abound, and I’ve personally run afoul of them. Backed by direct experimentation, this article unmasks the limits.
Matthew Barlocker — Blue Matador
This sounds an awful lot like those packet rate limits from the previous article…
Chris McFadden — SparkPost
Ever hear of that traffic intersection where they took out all of the signs, and suddenly everyone drove more safely? Woolworth’s tried a similar experiment with their stores, with interesting results.
Sidney Dekker — Safety Differently
Find out how they discovered the bug and what they did about it. Required reading if you use gRPC, since in some cases it falls to obey timeouts.
Ciaran Gaffney and Fran Garcia — Hosted Graphite
when we sit with a team to plan the experiment, that is when the light goes on… they start realising how many things they missed and they start cataloging what bad things could happen if something goes bad…
Russ Miles — ChaosIQ