General

SRE Weekly Issue #150

A message from our sponsor, VictorOps:

The golden signals of SRE are essential metrics to monitor when developing reliable systems. But, the golden signals are just the starting point. See how SRE teams are going past the golden signals to proactively build reliability into their services:

http://try.victorops.com/sreweekly/sre-golden-signals

Articles

This article is a condensed version of a talk, but it stands firmly on its own. Their Production-Grade Infrastructure Checklist is well worth a read.

Yevgeniy Brikman — Gruntwork

More and more, the reliability of our infrastructure is moving into the realm of life-critical.

Thanks to Richard Cook

Linda Comins — The Intelligencer for this one.

Detailed notes on lots of talks from SRECon, with a great sum-up at the top discussing the major themes of the conference.

Max Timchenko

Drawing from an @mipsytipsy Twitter thread from back in February, this article is a great analysis of why it’s right to put developers on call and how to make it humane. I especially like the part about paying extra for on-call, a practice I’ve been hearing more mentions of recently.

John Barton

Really? Never? I could have sworn I remembered reading about power outages…

Yevgeniy Sverdlik — DataCenter Knowledge

Lots of good stuff in this one about preventing mistakes and analyzing failures.

Rachel Bryan — Swansea University

Outages

SRE Weekly Issue #149

A message from our sponsor, VictorOps:

Runbook automation leads to nearly instant on-call incident response. SRE teams can leverage runbook automation to deepen cross-team collaboration, surface context to on-call responders, and shorten the incident lifecycle–ultimately helping overall service reliability:

http://try.victorops.com/sreweekly/runbook-automation-for-sre

Articles

But does that mean you don’t need to think about reliability issues associated with large-scale distributed systems? The answer is, not completely. While there are many things that GCP and Cloud Functions handle behind the scenes, you still need to keep a couple of best practices in mind while building a reliable serverless solution.

Slawomir Walkowski — Google

The Emotet malware gang is probably managing their server infrastructure better than most companies are running their internal or external IT systems.

Catalin Cimpanu — Zero Day

Designing a distributed data store is about juggling competing priorities. This author discusses the latency penalty you pay for synchronous replication, and why you might want it anyway.

Daniel Abadi

Learn how Etsy designed tooling and a repeatable process to forecast resource usage.

Daniel Schauenberg — Etsy

Check out how Grab implemented chaos engineering.

Roman Atachiants, Tharaka Wijebandara, Abeesh Thomas — Grab

Neat idea: use machine learning to select which automated tests to run for a given code change. The goal is a high likelihood of finding bugs while running fewer tests than traditional methods.

Mateusz Machalica, Alex Samylkin, Meredith Porth, and Satish Chandra — Facebook

In this blog post, we are going to discuss how the Auth0 Site Reliability team, led by Hernán Meydac Jean, used a progressive approach to build a mature service architecture characterized by high availability and reliability.

The system in question is a home-grown feature flags implementation.

Dan Arias — Auth0

Outages

The usual glut of Black Friday outages.  I hope you all had an uneventful Friday.

SRE Weekly Issue #148

A message from our sponsor, VictorOps:

In case you don’t know the actual numbers, the costs of downtime can be pretty extreme. The benefits of SRE not only extend to system reliability and deployment speed, but it also creates a lot of business value:

http://try.victorops.com/sreweekly/costs-of-downtime

Articles

Last year, PagerDuty shared their inident response documentation. Now they’ve posted their training materials as well!

PagerDuty

Dropbox’s write-heavy read-light usage pattern makes this architecture overview worth a read.

Diwaker Gupta — Dropbox

There are two reasons to love this paper. First off, we get some insights into the backend that powers WeChat; and secondly the authors share the design of the battle hardened overload control system DAGOR that has been in production at WeChat for five years.

Adrian Colyer — The Morning Paper (review and summary)

Zhou et al. (original paper)

A tale of a nearly business-ending security incident and outage. Transparency and solid incident management helped them survive the event and prosper.

Paul Biggar

The section titled “A surprising discovery” is really thought-provoking:

t turns out that a single (bot) user was sending us a lot of traffic to a particularly slow endpoint. So while this was impacting the p99 latency, it was in fact not impacting any other users.

Igor Wiedler — Travis CI

An (external) analysis of the GitHub outage, with a discussion of how Orchestrator reacts to a network partition.

Ayende Rahien

I’m working on a talk for kubecon in December! One of the points I want to get across is the amount of time/investment it takes to use new software in production without causing really serious incidents, and what that’s looked like for us in our use of Kubernetes.

Julia Evans

Outages

SRE Weekly Issue #147

A message from our sponsor, VictorOps:

Alert fatigue creates confusion, causes undue stress on your team, and hurts the overall reliability of the services you build. See how you can mitigate alert fatigue and build more reliable systems while making people happier:

http://try.victorops.com/sreweekly/effects-of-incident-alert-fatigue

Articles

This is an excellent summary of a talk on testing in production last month.

“Distributed systems are incredibly hostile to being cloned or imitated, or monitored or staged,” she said. “Trying to mirror your staging environment to production is a fool’s errand. Just give up.”

Joab Jackson — The New Stack

The Pros and Cons of Calvin and Spanner, two data-store papers published in 2012. According to the author, Calvin largely stands out as the favorite.

Daniel Abadi

What a cool concept!

RobinHood brings SLO violations down to 0.3%, compared to 30% SLO violations under the next best policy.

Adrian Colyer — The Morning Paper (summary)

Berger et al. (original paper)

With thousands(!) of MySQL shards, Dropbox needed a way to have transactions span multiple shards while maintaining consistency.

Daniel Tahara — Dropbox

This is an excellent introduction to heatmaps with some hints on how to interpret a couple common patterns.

Danyel Fisher — Honeycomb

This is a neat idea. By modelling the relationships between the components in your infrastructure, you can figure out which one might be to blame when everything starts alerting at once. Note: this article is heavily geared toward Instana.

Steve Waterworth — Instana

Automated bug fixing seems to be all the rage lately. I wonder, is it practical for companies that aren’t the size of Facebook or Google?

Johannes Bader, Satish Chandra, Eric Lippert, and Andrew Scott — Facebook

Outages

SRE Weekly Issue #146

A message from our sponsor, VictorOps:

Automation can be used to help classify incident severity and route alerts to the right person or team. Learn how SRE teams are leveraging a refined incident classification and alert routing process to improve system reliability:

http://try.victorops.com/sreweekly/classifying-incident-severity

Articles

NRE Labs is a no-strings-attached, community-centered initiative to bring the skills of automation within reach for everyone. Through short, simple exercises, all right here in the browser, you can learn the tools, skills, and processes that will put you on the path to becoming a Network Reliability Engineer.

Tips on designing your on-call to be fair to the humans involved, including gems like an automatic day off after a middle-of-the-night page.

David Mytton — StackPath

GitHub’s major outage stemmed from a brief cut in connectivity between two of their data centers.

Errata: Last week I mentioned the possibility of a network cut and cited an article about GitHub’s database architecture. I should have credited @dbaops, who made the connection.

Rumors of undocumented packet rate limits in EC2 abound, and I’ve personally run afoul of them. Backed by direct experimentation, this article unmasks the limits.

Matthew Barlocker — Blue Matador

This sounds an awful lot like those packet rate limits from the previous article…

Chris McFadden — SparkPost

Ever hear of that traffic intersection where they took out all of the signs, and suddenly everyone drove more safely? Woolworth’s tried a similar experiment with their stores, with interesting results.

Sidney Dekker — Safety Differently

Find out how they discovered the bug and what they did about it. Required reading if you use gRPC, since in some cases it falls to obey timeouts.

Ciaran Gaffney and Fran Garcia — Hosted Graphite

when we sit with a team to plan the experiment, that is when the light goes on… they start realising how many things they missed and they start cataloging what bad things could happen if something goes bad…

Russ Miles — ChaosIQ

Outages

A production of Tinker Tinker Tinker, LLC Frontier Theme