General

SRE Weekly Issue #151

A message from our sponsor, VictorOps:

SRE teams can use synthetic monitoring and real-user monitoring to create a holistic understanding of the way their system handles stress. See how SRE teams are already implementing synthetic and real-user monitoring tools:

http://try.victorops.com/sreweekly/synthetic-and-real-user-monitoring-for-sre

Articles

They used feature flags to safely transition from a single-host service to a horizontally-scaled distributed system.

Ciaran Egan and Cian Synnott — Hosted Graphite

Limits and quotas can really ruin your day, and it can be very difficult to predict limit exhaustion before a change reaches production, as we learn in this incident story from RealSelf.

Bakha Nurzhanov — RealSelf

The challenge: you have to defend against abuse to keep your service running, but the abuse detection also must not adversely impact the user experience.

Sahil Handa — LinkedIn

PagerDuty has developed a system for measuring on-call health, factoring in quantity of pages, time of each page, frequency, clustering of pages, etc. I love what they’re doing and I hope we see more of this in our industry.

Lisa Yang — PagerDuty

A summary of three outage stories from Honeycomb’s recent event. My favorite is the third:

While Google engineers had put in place procedures for ensuring bad code did not take down their servers, they hadn’t taken the same precautions with data pushes.

Alaina Valenzuela — Honeycomb

Looking at that title, I thought to myself, “Uh, because it’s better?” It’s worth a read though, because it so eloquently explains horizontal versus vertical scaling, why you’d do one or the other, and why horizontal scaling is hard.

Sean T. Allen — Wallaroo Labs

Netflix has some truly massive cache systems at a scale of hundreds of terabytes. Find out what they do to warm up new cache nodes before putting them in production.

Deva Jayaraman, Shashi Madappa, Sridhar Enugula, and Ioannis Papapanagiotou — Netflix

This article lays out a promising plan for reducing the number of technologies your engineering department is using while still giving engineers the freedom to choose the right tool for the job.

Charity Majors

Outages

SRE Weekly Issue #150

A message from our sponsor, VictorOps:

The golden signals of SRE are essential metrics to monitor when developing reliable systems. But, the golden signals are just the starting point. See how SRE teams are going past the golden signals to proactively build reliability into their services:

http://try.victorops.com/sreweekly/sre-golden-signals

Articles

This article is a condensed version of a talk, but it stands firmly on its own. Their Production-Grade Infrastructure Checklist is well worth a read.

Yevgeniy Brikman — Gruntwork

More and more, the reliability of our infrastructure is moving into the realm of life-critical.

Thanks to Richard Cook

Linda Comins — The Intelligencer for this one.

Detailed notes on lots of talks from SRECon, with a great sum-up at the top discussing the major themes of the conference.

Max Timchenko

Drawing from an @mipsytipsy Twitter thread from back in February, this article is a great analysis of why it’s right to put developers on call and how to make it humane. I especially like the part about paying extra for on-call, a practice I’ve been hearing more mentions of recently.

John Barton

Really? Never? I could have sworn I remembered reading about power outages…

Yevgeniy Sverdlik — DataCenter Knowledge

Lots of good stuff in this one about preventing mistakes and analyzing failures.

Rachel Bryan — Swansea University

Outages

SRE Weekly Issue #149

A message from our sponsor, VictorOps:

Runbook automation leads to nearly instant on-call incident response. SRE teams can leverage runbook automation to deepen cross-team collaboration, surface context to on-call responders, and shorten the incident lifecycle–ultimately helping overall service reliability:

http://try.victorops.com/sreweekly/runbook-automation-for-sre

Articles

But does that mean you don’t need to think about reliability issues associated with large-scale distributed systems? The answer is, not completely. While there are many things that GCP and Cloud Functions handle behind the scenes, you still need to keep a couple of best practices in mind while building a reliable serverless solution.

Slawomir Walkowski — Google

The Emotet malware gang is probably managing their server infrastructure better than most companies are running their internal or external IT systems.

Catalin Cimpanu — Zero Day

Designing a distributed data store is about juggling competing priorities. This author discusses the latency penalty you pay for synchronous replication, and why you might want it anyway.

Daniel Abadi

Learn how Etsy designed tooling and a repeatable process to forecast resource usage.

Daniel Schauenberg — Etsy

Check out how Grab implemented chaos engineering.

Roman Atachiants, Tharaka Wijebandara, Abeesh Thomas — Grab

Neat idea: use machine learning to select which automated tests to run for a given code change. The goal is a high likelihood of finding bugs while running fewer tests than traditional methods.

Mateusz Machalica, Alex Samylkin, Meredith Porth, and Satish Chandra — Facebook

In this blog post, we are going to discuss how the Auth0 Site Reliability team, led by Hernán Meydac Jean, used a progressive approach to build a mature service architecture characterized by high availability and reliability.

The system in question is a home-grown feature flags implementation.

Dan Arias — Auth0

Outages

The usual glut of Black Friday outages.  I hope you all had an uneventful Friday.

SRE Weekly Issue #148

A message from our sponsor, VictorOps:

In case you don’t know the actual numbers, the costs of downtime can be pretty extreme. The benefits of SRE not only extend to system reliability and deployment speed, but it also creates a lot of business value:

http://try.victorops.com/sreweekly/costs-of-downtime

Articles

Last year, PagerDuty shared their inident response documentation. Now they’ve posted their training materials as well!

PagerDuty

Dropbox’s write-heavy read-light usage pattern makes this architecture overview worth a read.

Diwaker Gupta — Dropbox

There are two reasons to love this paper. First off, we get some insights into the backend that powers WeChat; and secondly the authors share the design of the battle hardened overload control system DAGOR that has been in production at WeChat for five years.

Adrian Colyer — The Morning Paper (review and summary)

Zhou et al. (original paper)

A tale of a nearly business-ending security incident and outage. Transparency and solid incident management helped them survive the event and prosper.

Paul Biggar

The section titled “A surprising discovery” is really thought-provoking:

t turns out that a single (bot) user was sending us a lot of traffic to a particularly slow endpoint. So while this was impacting the p99 latency, it was in fact not impacting any other users.

Igor Wiedler — Travis CI

An (external) analysis of the GitHub outage, with a discussion of how Orchestrator reacts to a network partition.

Ayende Rahien

I’m working on a talk for kubecon in December! One of the points I want to get across is the amount of time/investment it takes to use new software in production without causing really serious incidents, and what that’s looked like for us in our use of Kubernetes.

Julia Evans

Outages

SRE Weekly Issue #147

A message from our sponsor, VictorOps:

Alert fatigue creates confusion, causes undue stress on your team, and hurts the overall reliability of the services you build. See how you can mitigate alert fatigue and build more reliable systems while making people happier:

http://try.victorops.com/sreweekly/effects-of-incident-alert-fatigue

Articles

This is an excellent summary of a talk on testing in production last month.

“Distributed systems are incredibly hostile to being cloned or imitated, or monitored or staged,” she said. “Trying to mirror your staging environment to production is a fool’s errand. Just give up.”

Joab Jackson — The New Stack

The Pros and Cons of Calvin and Spanner, two data-store papers published in 2012. According to the author, Calvin largely stands out as the favorite.

Daniel Abadi

What a cool concept!

RobinHood brings SLO violations down to 0.3%, compared to 30% SLO violations under the next best policy.

Adrian Colyer — The Morning Paper (summary)

Berger et al. (original paper)

With thousands(!) of MySQL shards, Dropbox needed a way to have transactions span multiple shards while maintaining consistency.

Daniel Tahara — Dropbox

This is an excellent introduction to heatmaps with some hints on how to interpret a couple common patterns.

Danyel Fisher — Honeycomb

This is a neat idea. By modelling the relationships between the components in your infrastructure, you can figure out which one might be to blame when everything starts alerting at once. Note: this article is heavily geared toward Instana.

Steve Waterworth — Instana

Automated bug fixing seems to be all the rage lately. I wonder, is it practical for companies that aren’t the size of Facebook or Google?

Johannes Bader, Satish Chandra, Eric Lippert, and Andrew Scott — Facebook

Outages

SRE WEEKLY © 2015 Frontier Theme