SRE Weekly Issue #152

A message from our sponsor, VictorOps:

SRE teams can leverage automation in chat to improve incident response and make on-call suck less. Learn the ins and outs about using automated ChatOps for incident response:

http://try.victorops.com/sreweekly/automated-chatops-in-incident-response

Articles

It’s hard to summarize all the awesome here, but it boils down to empathy, collaboration, and asking, “How can I help?”. These pay dividends all over an organization, especially in reliability.

Note: Will Gallego is my coworker, although I came across this post on my own.

Will Gallego

This followup post for a Google Groups outage was (fittingly) hidden away in a Google Group.

Thanks to Jonathan Rudenberg for this one.

Now I can link directly to specific incidents! I miss the graphs, though.

Jamie Hannaford — GitHub

I laughed so hard I scared my cats:

COWORKER: we need to find the root cause asap
ME: takes long drag the root cause is that our processes are not robust enough to prevent a person from making this mistake
COWORKER: amy please not right now”

Great discussion in the thread!

Amy Nguyen

In Air Traffic Control parlance, if a pilot or controller can’t satisfy with a request, they should state that they are “unable” to comply. It can be difficult to decide in the moment what one is truly “unable” to do. There are a lot of great lessons here that apply equally well to IT incident response.

Tarrance Kramer — AVweb

The common theme at KubeCon is that SRE teams at many companies produce reliable, reusable patterns for their developers to build with.

Beth Pariseau — TechTarget

This is the story of a tenacious fight to find out what went wrong during an incident. If you read nothing else, the Conclusion section has a lot of great tidbits.

Tony Meehan — Endgame

Here’s a new guide on how to apply Restorative Just Culture. This made me laugh:

They also fail to address the systemic issues that gave rise to the harms caused, since they reduce an incident to an individual who needs to be ‘just cultured’.

Sidney Dekker — Safety Differently

Outages

SRE Weekly Issue #151

A message from our sponsor, VictorOps:

SRE teams can use synthetic monitoring and real-user monitoring to create a holistic understanding of the way their system handles stress. See how SRE teams are already implementing synthetic and real-user monitoring tools:

http://try.victorops.com/sreweekly/synthetic-and-real-user-monitoring-for-sre

Articles

They used feature flags to safely transition from a single-host service to a horizontally-scaled distributed system.

Ciaran Egan and Cian Synnott — Hosted Graphite

Limits and quotas can really ruin your day, and it can be very difficult to predict limit exhaustion before a change reaches production, as we learn in this incident story from RealSelf.

Bakha Nurzhanov — RealSelf

The challenge: you have to defend against abuse to keep your service running, but the abuse detection also must not adversely impact the user experience.

Sahil Handa — LinkedIn

PagerDuty has developed a system for measuring on-call health, factoring in quantity of pages, time of each page, frequency, clustering of pages, etc. I love what they’re doing and I hope we see more of this in our industry.

Lisa Yang — PagerDuty

A summary of three outage stories from Honeycomb’s recent event. My favorite is the third:

While Google engineers had put in place procedures for ensuring bad code did not take down their servers, they hadn’t taken the same precautions with data pushes.

Alaina Valenzuela — Honeycomb

Looking at that title, I thought to myself, “Uh, because it’s better?” It’s worth a read though, because it so eloquently explains horizontal versus vertical scaling, why you’d do one or the other, and why horizontal scaling is hard.

Sean T. Allen — Wallaroo Labs

Netflix has some truly massive cache systems at a scale of hundreds of terabytes. Find out what they do to warm up new cache nodes before putting them in production.

Deva Jayaraman, Shashi Madappa, Sridhar Enugula, and Ioannis Papapanagiotou — Netflix

This article lays out a promising plan for reducing the number of technologies your engineering department is using while still giving engineers the freedom to choose the right tool for the job.

Charity Majors

Outages

SRE Weekly Issue #150

A message from our sponsor, VictorOps:

The golden signals of SRE are essential metrics to monitor when developing reliable systems. But, the golden signals are just the starting point. See how SRE teams are going past the golden signals to proactively build reliability into their services:

http://try.victorops.com/sreweekly/sre-golden-signals

Articles

This article is a condensed version of a talk, but it stands firmly on its own. Their Production-Grade Infrastructure Checklist is well worth a read.

Yevgeniy Brikman — Gruntwork

More and more, the reliability of our infrastructure is moving into the realm of life-critical.

Thanks to Richard Cook

Linda Comins — The Intelligencer for this one.

Detailed notes on lots of talks from SRECon, with a great sum-up at the top discussing the major themes of the conference.

Max Timchenko

Drawing from an @mipsytipsy Twitter thread from back in February, this article is a great analysis of why it’s right to put developers on call and how to make it humane. I especially like the part about paying extra for on-call, a practice I’ve been hearing more mentions of recently.

John Barton

Really? Never? I could have sworn I remembered reading about power outages…

Yevgeniy Sverdlik — DataCenter Knowledge

Lots of good stuff in this one about preventing mistakes and analyzing failures.

Rachel Bryan — Swansea University

Outages

SRE Weekly Issue #149

A message from our sponsor, VictorOps:

Runbook automation leads to nearly instant on-call incident response. SRE teams can leverage runbook automation to deepen cross-team collaboration, surface context to on-call responders, and shorten the incident lifecycle–ultimately helping overall service reliability:

http://try.victorops.com/sreweekly/runbook-automation-for-sre

Articles

But does that mean you don’t need to think about reliability issues associated with large-scale distributed systems? The answer is, not completely. While there are many things that GCP and Cloud Functions handle behind the scenes, you still need to keep a couple of best practices in mind while building a reliable serverless solution.

Slawomir Walkowski — Google

The Emotet malware gang is probably managing their server infrastructure better than most companies are running their internal or external IT systems.

Catalin Cimpanu — Zero Day

Designing a distributed data store is about juggling competing priorities. This author discusses the latency penalty you pay for synchronous replication, and why you might want it anyway.

Daniel Abadi

Learn how Etsy designed tooling and a repeatable process to forecast resource usage.

Daniel Schauenberg — Etsy

Check out how Grab implemented chaos engineering.

Roman Atachiants, Tharaka Wijebandara, Abeesh Thomas — Grab

Neat idea: use machine learning to select which automated tests to run for a given code change. The goal is a high likelihood of finding bugs while running fewer tests than traditional methods.

Mateusz Machalica, Alex Samylkin, Meredith Porth, and Satish Chandra — Facebook

In this blog post, we are going to discuss how the Auth0 Site Reliability team, led by Hernán Meydac Jean, used a progressive approach to build a mature service architecture characterized by high availability and reliability.

The system in question is a home-grown feature flags implementation.

Dan Arias — Auth0

Outages

The usual glut of Black Friday outages.  I hope you all had an uneventful Friday.

SRE Weekly Issue #148

A message from our sponsor, VictorOps:

In case you don’t know the actual numbers, the costs of downtime can be pretty extreme. The benefits of SRE not only extend to system reliability and deployment speed, but it also creates a lot of business value:

http://try.victorops.com/sreweekly/costs-of-downtime

Articles

Last year, PagerDuty shared their inident response documentation. Now they’ve posted their training materials as well!

PagerDuty

Dropbox’s write-heavy read-light usage pattern makes this architecture overview worth a read.

Diwaker Gupta — Dropbox

There are two reasons to love this paper. First off, we get some insights into the backend that powers WeChat; and secondly the authors share the design of the battle hardened overload control system DAGOR that has been in production at WeChat for five years.

Adrian Colyer — The Morning Paper (review and summary)

Zhou et al. (original paper)

A tale of a nearly business-ending security incident and outage. Transparency and solid incident management helped them survive the event and prosper.

Paul Biggar

The section titled “A surprising discovery” is really thought-provoking:

t turns out that a single (bot) user was sending us a lot of traffic to a particularly slow endpoint. So while this was impacting the p99 latency, it was in fact not impacting any other users.

Igor Wiedler — Travis CI

An (external) analysis of the GitHub outage, with a discussion of how Orchestrator reacts to a network partition.

Ayende Rahien

I’m working on a talk for kubecon in December! One of the points I want to get across is the amount of time/investment it takes to use new software in production without causing really serious incidents, and what that’s looked like for us in our use of Kubernetes.

Julia Evans

Outages

A production of Tinker Tinker Tinker, LLC Frontier Theme