SRE Weekly Issue #194

Articles

Last week, I mistakenly listed an outage as “Connectivity Issues”, when it should have been attributed to Squarespace. Sorry about that!

Sleep, Interrupted: Niall Richard Murphy on Taking the Emergency Out of On-Call

From the authors of the new Post-Incident Review Zine comes this summary of Niall Murphy’s SRECon talk. It’s way more than a talk recap, tying together related blog posts and talks from other authors.

Jaime Woo and Emil Stolarsky

The silence of the racks is deafening, production gear has gone dark – so which wire do we cut?

They didn’t trust the datacenter’s backup power, so they added rack UPSes. Little did they realize that a single UPS failure could take down all of the rest.

Richard Speed — The Register

Taiji: managing global user traffic for large-scale Internet services at the edge

Taiji chooses which datacenter to route a Facebook user’s traffic to. It identifies clusters of users that have friended each other and routes them to the same place, on the theory that they’re likely to be interested in the same content.

Adrian Colyer (summary)

Xu et al., SOSP’19 (original paper)

What Tracking Down Missing TCP Keepalives Taught Me About Docker, Golang, and GitLab

<3 detailed debugging stories. TIL: Google Compute Engine’s network drops connections from its state table after 10 minutes with no packets.

Stan Hu — GitHub

Monitoring server applications with Vortex

Vortex is DropBox’s custom-built metrics system, designed for horizontal scalability. Find out why they rolled their own and learn how it works in this article that includes shiny diagrams.

Dave Zbarsky — DropBox

Magic Numbers and second guessing SLOs – why is 96% better than 95%?

How do we come up with our SLOs, anyway? This one puts me in mind of Will Gallego’s post on error budgets.

Dean Wilson (@unixdaemon)

Snap: a microkernel approach to host networking

A network stack in userland as an alternative to TCP/IP? Yup, that seems like a pretty Google thing to do.

Adrian Colyer (summary)

Marty et al., SOSP’19 (original paper)

Outages

Disney+ (streaming service)
- Disney’s new streaming service suffered a few hiccups due to unexpectedly high demand.
Codeanywhere
Google Nest
NFL Network (streaming service)
YouTube
Hulu
Heroku: followup for incident #1922
Heroku
Transferwise
Google Cloud Platform
- A problem with KMS impacted multiple services in several regions. Google’s detailed followup analysis is linked above.

SRE Weekly Issue #194

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

A message from our sponsor, VictorOps:

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues