Articles
Last week, I mistakenly listed an outage as “Connectivity Issues”, when it should have been attributed to Squarespace. Sorry about that!
From the authors of the new Post-Incident Review Zine comes this summary of Niall Murphy’s SRECon talk. It’s way more than a talk recap, tying together related blog posts and talks from other authors.
Jaime Woo and Emil Stolarsky
They didn’t trust the datacenter’s backup power, so they added rack UPSes. Little did they realize that a single UPS failure could take down all of the rest.
Richard Speed — The Register
Taiji chooses which datacenter to route a Facebook user’s traffic to. It identifies clusters of users that have friended each other and routes them to the same place, on the theory that they’re likely to be interested in the same content.
Adrian Colyer (summary)
Xu et al., SOSP’19 (original paper)
<3 detailed debugging stories. TIL: Google Compute Engine’s network drops connections from its state table after 10 minutes with no packets.
Stan Hu — GitHub
Vortex is DropBox’s custom-built metrics system, designed for horizontal scalability. Find out why they rolled their own and learn how it works in this article that includes shiny diagrams.
Dave Zbarsky — DropBox
How do we come up with our SLOs, anyway? This one puts me in mind of Will Gallego’s post on error budgets.
Dean Wilson (@unixdaemon)
A network stack in userland as an alternative to TCP/IP? Yup, that seems like a pretty Google thing to do.
Adrian Colyer (summary)
Marty et al., SOSP’19 (original paper)
Outages
- Disney+ (streaming service)
- Disney’s new streaming service suffered a few hiccups due to unexpectedly high demand.
- Codeanywhere
- Google Nest
- NFL Network (streaming service)
- YouTube
- Hulu
- Heroku: followup for incident #1922
- Heroku
- Transferwise
- Google Cloud Platform
- A problem with KMS impacted multiple services in several regions. Google’s detailed followup analysis is linked above.