General

SRE Weekly Issue #148

A message from our sponsor, VictorOps:

In case you don’t know the actual numbers, the costs of downtime can be pretty extreme. The benefits of SRE not only extend to system reliability and deployment speed, but it also creates a lot of business value:

http://try.victorops.com/sreweekly/costs-of-downtime

Articles

Last year, PagerDuty shared their inident response documentation. Now they’ve posted their training materials as well!

PagerDuty

Dropbox’s write-heavy read-light usage pattern makes this architecture overview worth a read.

Diwaker Gupta — Dropbox

There are two reasons to love this paper. First off, we get some insights into the backend that powers WeChat; and secondly the authors share the design of the battle hardened overload control system DAGOR that has been in production at WeChat for five years.

Adrian Colyer — The Morning Paper (review and summary)

Zhou et al. (original paper)

A tale of a nearly business-ending security incident and outage. Transparency and solid incident management helped them survive the event and prosper.

Paul Biggar

The section titled “A surprising discovery” is really thought-provoking:

t turns out that a single (bot) user was sending us a lot of traffic to a particularly slow endpoint. So while this was impacting the p99 latency, it was in fact not impacting any other users.

Igor Wiedler — Travis CI

An (external) analysis of the GitHub outage, with a discussion of how Orchestrator reacts to a network partition.

Ayende Rahien

I’m working on a talk for kubecon in December! One of the points I want to get across is the amount of time/investment it takes to use new software in production without causing really serious incidents, and what that’s looked like for us in our use of Kubernetes.

Julia Evans

Outages

SRE Weekly Issue #147

A message from our sponsor, VictorOps:

Alert fatigue creates confusion, causes undue stress on your team, and hurts the overall reliability of the services you build. See how you can mitigate alert fatigue and build more reliable systems while making people happier:

http://try.victorops.com/sreweekly/effects-of-incident-alert-fatigue

Articles

This is an excellent summary of a talk on testing in production last month.

“Distributed systems are incredibly hostile to being cloned or imitated, or monitored or staged,” she said. “Trying to mirror your staging environment to production is a fool’s errand. Just give up.”

Joab Jackson — The New Stack

The Pros and Cons of Calvin and Spanner, two data-store papers published in 2012. According to the author, Calvin largely stands out as the favorite.

Daniel Abadi

What a cool concept!

RobinHood brings SLO violations down to 0.3%, compared to 30% SLO violations under the next best policy.

Adrian Colyer — The Morning Paper (summary)

Berger et al. (original paper)

With thousands(!) of MySQL shards, Dropbox needed a way to have transactions span multiple shards while maintaining consistency.

Daniel Tahara — Dropbox

This is an excellent introduction to heatmaps with some hints on how to interpret a couple common patterns.

Danyel Fisher — Honeycomb

This is a neat idea. By modelling the relationships between the components in your infrastructure, you can figure out which one might be to blame when everything starts alerting at once. Note: this article is heavily geared toward Instana.

Steve Waterworth — Instana

Automated bug fixing seems to be all the rage lately. I wonder, is it practical for companies that aren’t the size of Facebook or Google?

Johannes Bader, Satish Chandra, Eric Lippert, and Andrew Scott — Facebook

Outages

SRE Weekly Issue #146

A message from our sponsor, VictorOps:

Automation can be used to help classify incident severity and route alerts to the right person or team. Learn how SRE teams are leveraging a refined incident classification and alert routing process to improve system reliability:

http://try.victorops.com/sreweekly/classifying-incident-severity

Articles

NRE Labs is a no-strings-attached, community-centered initiative to bring the skills of automation within reach for everyone. Through short, simple exercises, all right here in the browser, you can learn the tools, skills, and processes that will put you on the path to becoming a Network Reliability Engineer.

Tips on designing your on-call to be fair to the humans involved, including gems like an automatic day off after a middle-of-the-night page.

David Mytton — StackPath

GitHub’s major outage stemmed from a brief cut in connectivity between two of their data centers.

Errata: Last week I mentioned the possibility of a network cut and cited an article about GitHub’s database architecture. I should have credited @dbaops, who made the connection.

Rumors of undocumented packet rate limits in EC2 abound, and I’ve personally run afoul of them. Backed by direct experimentation, this article unmasks the limits.

Matthew Barlocker — Blue Matador

This sounds an awful lot like those packet rate limits from the previous article…

Chris McFadden — SparkPost

Ever hear of that traffic intersection where they took out all of the signs, and suddenly everyone drove more safely? Woolworth’s tried a similar experiment with their stores, with interesting results.

Sidney Dekker — Safety Differently

Find out how they discovered the bug and what they did about it. Required reading if you use gRPC, since in some cases it falls to obey timeouts.

Ciaran Gaffney and Fran Garcia — Hosted Graphite

when we sit with a team to plan the experiment, that is when the light goes on… they start realising how many things they missed and they start cataloging what bad things could happen if something goes bad…

Russ Miles — ChaosIQ

Outages

SRE Weekly Issue #145

A message from our sponsor, VictorOps:

When SRE teams track incident management KPIs and benchmarks, they better optimize the way they operate–helping SREs create more resilient teams and build more reliable systems:

http://try.victorops.com/sreweekly/top-incident-management-kpis

Articles

An article on looking past human error in investigating air sports (definition) accidents, drawing on the writing of Don Norman. Special emphasis on slips versus mistakes:

“Slips tend to occur more frequently to skilled people than to novices
[…]

Mara Schmid — Blue Skies Magazine

An VP of NS1 explains how his company rewrote and deployed their core service without downtime.

Shannon Weyric — NS1

This guide from Hosted Graphite has a ton of great advice and reads almost as if they’ve released their internal incident response guidelines. Bonus content: check out this exemplary post-incident followup from their status site.

Fran Garcia — Hosted Graphite

Check it out, Atlassian posted their incident management documentation publicly!

On Monday I gave a talk at DOES18 called “All the World’s a Platform”, where I talked about a bunch of the lessons learned by using and abusing and running and building platforms at scale.

I promised to do a blog post with the takeaways, so here they are.

Charity Majors

[…] at a certain point, it’s too expensive to keep fixing bugs because of the high-opportunity cost of building new features. You need to decide your target for stability just like you would availability, and it should not be 100%.

Kristine Pinedo — Bugsnag

Maelstrom is Facebook’s tool to assist engineers in safely moving traffic off of impaired infrastructure.

Adrian Colyer — The Morning Paper (summary)
Veeraraghavan et al. — Facebook (original paper)

Attempting to stamp out failure entirely can have the paradoxical effect of reducing resiliency to anomalous situations. Instead, we need to handle failure constructively.

Daniel Hummerdal — Safety Differently

v

Outages

SRE Weekly Issue #144

A message from our sponsor, VictorOps:

Customers expect reliability–even in today’s era of CI/CD and Agile software development. That’s why SRE is more important than ever. Learn about the importance of getting buy-in from your entire team when taking on SRE:

http://try.victorops.com/sreweekly/organizational-sre-support

Articles

GitLab is incredibly open with their policies, and incident management is no exception.

GitLab

Ooh, new newsletter! This one focuses specifically on resiliency. It seems to have just a few articles each week with in-depth summaries.

Thai Wood

This article starts with a fictitious(?) account of the kind of failure that can occur when teams step on each other’s toes in a serverless environment. It goes on to discuss techniques for dealing with this class of problems, including careful permission management.

Toby Fee — jaxenter

Sometimes fixing a rarely-occurring bug can be especially difficult. Recording enough information all the time to debug those rare failures would be too resource-intensive. Check out this fascinating technique for working backward from a memory dump to infer the prior contents of memory in the time leading up to a failure.

Adrian Colyer — The Morning Paper (summary)
Cui et al. (original paper)

An introduction to the concept of reactive systems including a description of the high-level architectural features.

Sinkevich Uladzimir — The Server Side

Initially, you can improve reliability by studying incidents to find out what went wrong. This article explains why that strategy will only get you so far.

Thanks to Thomas Depierre for this one.
Sidney Dekker — Safety Differently

Chaos Monkey Guide for Engineers – Tips, Tutorials, and Training

Gremlin released this huge guide on chaos monkey, covering theory, practice, further reading, and lots of other resources.

Gremlin, inc.

Outages

A production of Tinker Tinker Tinker, LLC Frontier Theme