SRE Weekly Issue #151

Articles

A victim of its own popularity: Scaling our CloudWatch integration

They used feature flags to safely transition from a single-host service to a horizontally-scaled distributed system.

Ciaran Egan and Cian Synnott — Hosted Graphite

Limits and quotas can really ruin your day, and it can be very difficult to predict limit exhaustion before a change reaches production, as we learn in this incident story from RealSelf.

Bakha Nurzhanov — RealSelf

Defending Against Abuse at LinkedIn’s Scale

The challenge: you have to defend against abuse to keep your service running, but the abuse detection also must not adversely impact the user experience.

Sahil Handa — LinkedIn

Answer to the Ultimate Question of (On-Call) Life, the Universe, and Everything: 71

PagerDuty has developed a system for measuring on-call health, factoring in quantity of pages, time of each page, frequency, clustering of pages, etc. I love what they’re doing and I hope we see more of this in our industry.

Lisa Yang — PagerDuty

Spooky Tales of Testing In Production: A Recap and Lessons Learned

A summary of three outage stories from Honeycomb’s recent event. My favorite is the third:

While Google engineers had put in place procedures for ensuring bad code did not take down their servers, they hadn’t taken the same precautions with data pushes.

Alaina Valenzuela — Honeycomb

Reasons to Scale Horizontally

Looking at that title, I thought to myself, “Uh, because it’s better?” It’s worth a read though, because it so eloquently explains horizontal versus vertical scaling, why you’d do one or the other, and why horizontal scaling is hard.

Sean T. Allen — Wallaroo Labs

Cache warming: Agility for a stateful service

Netflix has some truly massive cache systems at a scale of hundreds of terabytes. Find out what they do to warm up new cache nodes before putting them in production.

Deva Jayaraman, Shashi Madappa, Sridhar Enugula, and Ioannis Papapanagiotou — Netflix

Software Sprawl, The Golden Path, and Scaling Teams With Agency

This article lays out a promising plan for reducing the number of technologies your engineering department is using while still giving engineers the freedom to choose the right tool for the job.

Charity Majors

Outages

Nest
GitHub
O2 (UK) and SoftBank (Japan)
- I normally don’t bother mentioning mobile phone service outages, but this one has an interesting cause: an expired TLS certificate in Ericsson’s systems.
Google Allo and Duo
Facebook

SRE Weekly Issue #151

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

A message from our sponsor, VictorOps:

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues