General

SRE Weekly Issue #247

A message from our sponsor, StackHawk:

The ZAP open source project is the underlying security scanner for StackHawk. Check out this 21 minute introduction to ZAP from project founder and core-contributor Simon Bennetts.
https://sthwk.com/zap-intro-video

Articles

This incident report from a September Datadog outage has an interesting tidbit aboiut scaling external incident response in tandem with internal.

Alexis Lê-Quôc — Datadog

This is Google’s write-up for an interesting issue that involved repeated re-sending of invitations to edit a Google Drive document.

Google

I basically want to immediately absorb any article with this title, unless it’s just clickbait spam. This one definitely isn’t.

Ronak Nathani

Lots of juicy details in this one about the difficulty Slack has had in scaling their DB layer and how Vitess solved their problems.

Arka Ganguli, Guido Iaquinti, Maggie Zhou, and Rafael Chacón — Slack

Hitting file descriptor limits is such an annoying kind of outage. Some good tips here, clearly coming from hard-won experience.

Utsav Shah

They used two providers synced with OctoDNS.

Ryan Timken and Kiran Naidoo — Cloudflare

This is all about understanding the whole system (people and technology) and building learning, rather than finding a superficial “root cause”.

Piyush Verma — Last9

Outages

SRE Weekly Issue #246

A message from our sponsor, StackHawk:

Looking to get started with application security testing in CI/CD? Here is a broad overview of steps you can take.
https://sthwk.com/how-to-app-sec-in-ci

Articles

DNS-based load balancing is a nice simple solution, but unfortunately it doesn’t work well in certain circumstances. Read to find out how Algolia evolved their load balancing system in response.

Paul Berthaux — Algolia

We use percentiles all the time, so it’s really important to actually understand what they say (and what they don’t).

Piyush Verma — Last9

Thanks to An anonymous reader for this one.

The author started out as an embedded systems developer and moved into SRE. Here’s what they learned.

Eric Uriostigue — effx

Some great tips here. It’s hard to sound sincere in a public incident report, especially if you post a lot of them.

Adam Fowler

In this blog, we discuss how we built Fare Storage, Grab’s single source of truth fare data store, and how we overcame the challenges to make it more reliable and scalable to support our expanding features.

Sourabh Suman — Grab

This article covers Netflix’s gnmi-gateway, their open source tool for collecting metrics from network devices in a highly available and fault-tolerant manner.

Colin McIntosh and Michael Costello — Netflix

This year, re:Invent is online only, so you still have a chance to attend if you’re interested.

Ana M Medina — Gremlin

Cloudflare’s API service was impaired early this month. This is their incident report that describes a grey failure in a switch and downstream impact to etcd and their database system.

Tom Lianza and Chris Snook — Cloudflare

Outages

SRE Weekly Issue #245

A message from our sponsor, StackHawk:

Check out how we have built our microservices in Kubernetes here at StackHawk.
https://sthwk.com/kube-services

Articles

A Certificate Transparency (CT) log failed, resulting in its permanent retirement. The incident involved unintended effects from load testing being performed in a staging environment. I have a huge amount of admiration and respect for the transparency of certification authorities (CAs) when things go wrong.

Trust Asia

I like the idea that adding the ability to fail over to your system makes it much more complicated and thus more likely to fail.

Andre Newman — Gremlin

This one introduces some interesting concepts: the error kernel and property testing.

Kenneth Cross — HelloSign

[…] to be resilient, we must test everything, which consumes time that we don’t spend innovating. A good trade-off is to test in production.

Xavier Grand — Algolia

More useful tips as you develop your post-incident analysis process. I like their definition of “blameless”.

Zachary Flower — Splunk

Exactly once delivery is hard to implement and requires explicit coordination at all levels, including the client. Ably explains how their flavor works.

Paddy Byers — Ably

The most effective (if scary) way to understand how your stateless service operates under load

Utsav Shah — Software at Scale

Some good tips here — and a reminder that we may see even more traffic than normal due to social distancing.

Outages

SRE Weekly Issue #244

A message from our sponsor, StackHawk:

Are you attending KubeCon this week? Be sure to swing by StackHawk’s virtual booth to get a t-shirt and be entered to win a Nintendo Switch.

Articles

If you’re gonna operate on a pile of computers all at once that numbers 6+ figures, making you type that number in is a way to make you pause and think about what you’re doing.

Rachel by the bay

Find out why they decided to focus less on nines, and what they did instead.

Robert Sullivan

Reminds me of the classic:

It’s not DNS
There’s no way it’s DNS
It was DNS

— (ssbroski on reddit)
Mike S.

Their front-end made duplicate calls to the new API to test load and response time prior to cutting over.

Michael P. Geraci — OkCupid

This is really cool. The researchers created a role-play scenario based on a real plane crash. They tried to get participants to blame “human error”, so that they could then surprise them with all of the (many) contributing factors that were involved.

Emily S. Patterson, Richard I. Cook, David D. Woods, Marta L. Render

Tips from one Sysadmin’s journey to becoming an SRE.

Josh Duffney — Octopus Deploy

Outages

SRE Weekly Issue #243

A message from our sponsor, StackHawk:

The shift to rapid, frequent deployments over the past decade initially left application security behind. Modern AppSec belongs in the CI/CD pipeline.
http://sthwk.com/app-sec-pipeline

Articles

Sometimes I come across a simple but mind-blowingly awesome new idea. This is one of those times.

During periods of high load and errors, Netflix’s edge load balancer sends feedback to the apps running on users’ devices, adjusting their retry and backoff strategy to keep the service running as smoothly as possible but avoid a thundering herd. Brilliant.

Manuel Correa, Arthur Gonigberg, and Daniel West — Netflix

I helped to invent new approaches to correlate telemetry signals (exemplars, correlation between tracing and logging, profiler labels) that helped our engineers to navigate latency problems faster.

Facebook has two very different users for live streaming: “normal” users and broadcasters streaming sporting events and the like.

Hemal Khatri, Alex Lambert, Jordi Cenzano and Rodrigo Broilo — Facebook

This article covers the outcomes of research performed in 2019 on how engineers at Google debug production issues, including the types of tools, high-level strategies, and low-level tasks that engineers use in varying combinations to debug effectively.

Charisma Chan and Beth Cooper — Google

The three patterns discussed in this paper are:

  • decompensation
  • working at cross purposes
  • getting stuck in outdated behaviors

David Woods and Matthieu Branlat

Outages

A production of Tinker Tinker Tinker, LLC Frontier Theme