General

SRE Weekly Issue #153

A message from our sponsor, VictorOps:

SRE teams can leverage chaos engineering, stress testing and load testing tools to proactively build reliability into the services you build. This list of open source chaos tools can help you get started:

http://try.victorops.com/sreweekly/open-source-chaos-testing-tools

Articles

In this podcast episode, Courtney Eckhardt and the panel cover a lot of bases related to incident response, retrospectives, defensiveness, blamelessness, social justice, and tons more engrossing stuff. Well worth a listen.

Mandy Moore (summary); John K. Sawers, Sam Livingston-Gray, Jamey Hampton, and Coraline Ada Ehmke (panelists); Courtney Eckhardt (guest)

Do you wonder what effect partitioned versus unified consistency might have on latency? Do you want to know what those terms mean? Read on.

Daniel Abadi

Cape is Dropbox’s real-time event processing system. The design bits in this article have a ton of interesting detail, and I also love the part where they go into their motivations behind not just using an existing queuing system.

Peng Kang — Dropbox

This is a great intro to the circuit breaker pattern if you’re unfamiliar with it, and it’s also got a lot of meaty content for folks experienced with them.

Corey Scott — Grab

Though it sounds counterintuitive, more dashboards often make people less informed and less aligned.

Having a few good dashboards is important, but if you have too many, it’ll get in the way of your ability to do dynamic analysis.

Benn Stancil — Mode

What activities count as SRE work, versus “just” Operations?

Site Reliability Engineering do Operations but are not an Operations Team.

Stephen Thorne

Outages

SRE Weekly Issue #152

A message from our sponsor, VictorOps:

SRE teams can leverage automation in chat to improve incident response and make on-call suck less. Learn the ins and outs about using automated ChatOps for incident response:

http://try.victorops.com/sreweekly/automated-chatops-in-incident-response

Articles

It’s hard to summarize all the awesome here, but it boils down to empathy, collaboration, and asking, “How can I help?”. These pay dividends all over an organization, especially in reliability.

Note: Will Gallego is my coworker, although I came across this post on my own.

Will Gallego

This followup post for a Google Groups outage was (fittingly) hidden away in a Google Group.

Thanks to Jonathan Rudenberg for this one.

Now I can link directly to specific incidents! I miss the graphs, though.

Jamie Hannaford — GitHub

I laughed so hard I scared my cats:

COWORKER: we need to find the root cause asap
ME: takes long drag the root cause is that our processes are not robust enough to prevent a person from making this mistake
COWORKER: amy please not right now”

Great discussion in the thread!

Amy Nguyen

In Air Traffic Control parlance, if a pilot or controller can’t satisfy with a request, they should state that they are “unable” to comply. It can be difficult to decide in the moment what one is truly “unable” to do. There are a lot of great lessons here that apply equally well to IT incident response.

Tarrance Kramer — AVweb

The common theme at KubeCon is that SRE teams at many companies produce reliable, reusable patterns for their developers to build with.

Beth Pariseau — TechTarget

This is the story of a tenacious fight to find out what went wrong during an incident. If you read nothing else, the Conclusion section has a lot of great tidbits.

Tony Meehan — Endgame

Here’s a new guide on how to apply Restorative Just Culture. This made me laugh:

They also fail to address the systemic issues that gave rise to the harms caused, since they reduce an incident to an individual who needs to be ‘just cultured’.

Sidney Dekker — Safety Differently

Outages

SRE Weekly Issue #151

A message from our sponsor, VictorOps:

SRE teams can use synthetic monitoring and real-user monitoring to create a holistic understanding of the way their system handles stress. See how SRE teams are already implementing synthetic and real-user monitoring tools:

http://try.victorops.com/sreweekly/synthetic-and-real-user-monitoring-for-sre

Articles

They used feature flags to safely transition from a single-host service to a horizontally-scaled distributed system.

Ciaran Egan and Cian Synnott — Hosted Graphite

Limits and quotas can really ruin your day, and it can be very difficult to predict limit exhaustion before a change reaches production, as we learn in this incident story from RealSelf.

Bakha Nurzhanov — RealSelf

The challenge: you have to defend against abuse to keep your service running, but the abuse detection also must not adversely impact the user experience.

Sahil Handa — LinkedIn

PagerDuty has developed a system for measuring on-call health, factoring in quantity of pages, time of each page, frequency, clustering of pages, etc. I love what they’re doing and I hope we see more of this in our industry.

Lisa Yang — PagerDuty

A summary of three outage stories from Honeycomb’s recent event. My favorite is the third:

While Google engineers had put in place procedures for ensuring bad code did not take down their servers, they hadn’t taken the same precautions with data pushes.

Alaina Valenzuela — Honeycomb

Looking at that title, I thought to myself, “Uh, because it’s better?” It’s worth a read though, because it so eloquently explains horizontal versus vertical scaling, why you’d do one or the other, and why horizontal scaling is hard.

Sean T. Allen — Wallaroo Labs

Netflix has some truly massive cache systems at a scale of hundreds of terabytes. Find out what they do to warm up new cache nodes before putting them in production.

Deva Jayaraman, Shashi Madappa, Sridhar Enugula, and Ioannis Papapanagiotou — Netflix

This article lays out a promising plan for reducing the number of technologies your engineering department is using while still giving engineers the freedom to choose the right tool for the job.

Charity Majors

Outages

SRE Weekly Issue #150

A message from our sponsor, VictorOps:

The golden signals of SRE are essential metrics to monitor when developing reliable systems. But, the golden signals are just the starting point. See how SRE teams are going past the golden signals to proactively build reliability into their services:

http://try.victorops.com/sreweekly/sre-golden-signals

Articles

This article is a condensed version of a talk, but it stands firmly on its own. Their Production-Grade Infrastructure Checklist is well worth a read.

Yevgeniy Brikman — Gruntwork

More and more, the reliability of our infrastructure is moving into the realm of life-critical.

Thanks to Richard Cook

Linda Comins — The Intelligencer for this one.

Detailed notes on lots of talks from SRECon, with a great sum-up at the top discussing the major themes of the conference.

Max Timchenko

Drawing from an @mipsytipsy Twitter thread from back in February, this article is a great analysis of why it’s right to put developers on call and how to make it humane. I especially like the part about paying extra for on-call, a practice I’ve been hearing more mentions of recently.

John Barton

Really? Never? I could have sworn I remembered reading about power outages…

Yevgeniy Sverdlik — DataCenter Knowledge

Lots of good stuff in this one about preventing mistakes and analyzing failures.

Rachel Bryan — Swansea University

Outages

SRE Weekly Issue #149

A message from our sponsor, VictorOps:

Runbook automation leads to nearly instant on-call incident response. SRE teams can leverage runbook automation to deepen cross-team collaboration, surface context to on-call responders, and shorten the incident lifecycle–ultimately helping overall service reliability:

http://try.victorops.com/sreweekly/runbook-automation-for-sre

Articles

But does that mean you don’t need to think about reliability issues associated with large-scale distributed systems? The answer is, not completely. While there are many things that GCP and Cloud Functions handle behind the scenes, you still need to keep a couple of best practices in mind while building a reliable serverless solution.

Slawomir Walkowski — Google

The Emotet malware gang is probably managing their server infrastructure better than most companies are running their internal or external IT systems.

Catalin Cimpanu — Zero Day

Designing a distributed data store is about juggling competing priorities. This author discusses the latency penalty you pay for synchronous replication, and why you might want it anyway.

Daniel Abadi

Learn how Etsy designed tooling and a repeatable process to forecast resource usage.

Daniel Schauenberg — Etsy

Check out how Grab implemented chaos engineering.

Roman Atachiants, Tharaka Wijebandara, Abeesh Thomas — Grab

Neat idea: use machine learning to select which automated tests to run for a given code change. The goal is a high likelihood of finding bugs while running fewer tests than traditional methods.

Mateusz Machalica, Alex Samylkin, Meredith Porth, and Satish Chandra — Facebook

In this blog post, we are going to discuss how the Auth0 Site Reliability team, led by Hernán Meydac Jean, used a progressive approach to build a mature service architecture characterized by high availability and reliability.

The system in question is a home-grown feature flags implementation.

Dan Arias — Auth0

Outages

The usual glut of Black Friday outages.  I hope you all had an uneventful Friday.

A production of Tinker Tinker Tinker, LLC Frontier Theme