They used feature flags to safely transition from a single-host service to a horizontally-scaled distributed system.
Ciaran Egan and Cian Synnott — Hosted Graphite
Limits and quotas can really ruin your day, and it can be very difficult to predict limit exhaustion before a change reaches production, as we learn in this incident story from RealSelf.
Bakha Nurzhanov — RealSelf
The challenge: you have to defend against abuse to keep your service running, but the abuse detection also must not adversely impact the user experience.
Sahil Handa — LinkedIn
PagerDuty has developed a system for measuring on-call health, factoring in quantity of pages, time of each page, frequency, clustering of pages, etc. I love what they’re doing and I hope we see more of this in our industry.
Lisa Yang — PagerDuty
A summary of three outage stories from Honeycomb’s recent event. My favorite is the third:
While Google engineers had put in place procedures for ensuring bad code did not take down their servers, they hadn’t taken the same precautions with data pushes.
Alaina Valenzuela — Honeycomb
Looking at that title, I thought to myself, “Uh, because it’s better?” It’s worth a read though, because it so eloquently explains horizontal versus vertical scaling, why you’d do one or the other, and why horizontal scaling is hard.
Sean T. Allen — Wallaroo Labs
Netflix has some truly massive cache systems at a scale of hundreds of terabytes. Find out what they do to warm up new cache nodes before putting them in production.
Deva Jayaraman, Shashi Madappa, Sridhar Enugula, and Ioannis Papapanagiotou — Netflix
This article lays out a promising plan for reducing the number of technologies your engineering department is using while still giving engineers the freedom to choose the right tool for the job.