SRE Weekly Issue #246

A message from our sponsor, StackHawk:

Looking to get started with application security testing in CI/CD? Here is a broad overview of steps you can take.
https://sthwk.com/how-to-app-sec-in-ci

Articles

DNS-based load balancing is a nice simple solution, but unfortunately it doesn’t work well in certain circumstances. Read to find out how Algolia evolved their load balancing system in response.

Paul Berthaux — Algolia

We use percentiles all the time, so it’s really important to actually understand what they say (and what they don’t).

Piyush Verma — Last9

Thanks to An anonymous reader for this one.

The author started out as an embedded systems developer and moved into SRE. Here’s what they learned.

Eric Uriostigue — effx

Some great tips here. It’s hard to sound sincere in a public incident report, especially if you post a lot of them.

Adam Fowler

In this blog, we discuss how we built Fare Storage, Grab’s single source of truth fare data store, and how we overcame the challenges to make it more reliable and scalable to support our expanding features.

Sourabh Suman — Grab

This article covers Netflix’s gnmi-gateway, their open source tool for collecting metrics from network devices in a highly available and fault-tolerant manner.

Colin McIntosh and Michael Costello — Netflix

This year, re:Invent is online only, so you still have a chance to attend if you’re interested.

Ana M Medina — Gremlin

Cloudflare’s API service was impaired early this month. This is their incident report that describes a grey failure in a switch and downstream impact to etcd and their database system.

Tom Lianza and Chris Snook — Cloudflare

Outages

Updated: November 29, 2020 — 8:24 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme