SRE Weekly Issue #194

A message from our sponsor, VictorOps:

As DevOps and IT teams ingest more alerts and respond to more incidents, they collect more information and historical context. Today, teams are using this data to optimize incident response through constant automation and machine learning.

https://go.victorops.com/sreweekly-incident-response-automation-and-machine-learning

Articles

Last week, I mistakenly listed an outage as “Connectivity Issues”, when it should have been attributed to Squarespace. Sorry about that!

From the authors of the new Post-Incident Review Zine comes this summary of Niall Murphy’s SRECon talk. It’s way more than a talk recap, tying together related blog posts and talks from other authors.

Jaime Woo and Emil Stolarsky

They didn’t trust the datacenter’s backup power, so they added rack UPSes. Little did they realize that a single UPS failure could take down all of the rest.

Richard Speed — The Register

Taiji chooses which datacenter to route a Facebook user’s traffic to. It identifies clusters of users that have friended each other and routes them to the same place, on the theory that they’re likely to be interested in the same content.

Adrian Colyer (summary)

Xu et al., SOSP’19 (original paper)

<3 detailed debugging stories. TIL: Google Compute Engine’s network drops connections from its state table after 10 minutes with no packets.

Stan Hu — GitHub

Vortex is DropBox’s custom-built metrics system, designed for horizontal scalability. Find out why they rolled their own and learn how it works in this article that includes shiny diagrams.

Dave Zbarsky — DropBox

How do we come up with our SLOs, anyway? This one puts me in mind of Will Gallego’s post on error budgets.

Dean Wilson (@unixdaemon)

A network stack in userland as an alternative to TCP/IP? Yup, that seems like a pretty Google thing to do.

Adrian Colyer (summary)

Marty et al., SOSP’19 (original paper)

Outages

Updated: November 17, 2019 — 8:59 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme