SRE Weekly Issue #133

SPONSOR MESSAGE

A big part of SRE is outage preparation and confidence. See how a DevOps culture of collaboration and accountability can better prepare your SRE team for outages:

http://try.victorops.com/sreweekly/sre-outage-collaboration

Articles

My sincerest apology to Ali Haider Zaveri, author of the article Location-Aware Distribution: Configuring servers at scale. I originally miscredited the article to two folks, claiming they were from Facebook when in fact they work at Google.

As Grubhub built out their service-oriented architecture, they first developed “base frameworks for building highly available, distributed services”.

William Blackie — Grubhub

Cloudflare discusses an optimization that improves their p99 response time in the face of occasionally slow disk access. Today I learned: Linux does not allow for non-blocking disk reads.

Ka-Hing Cheung — Cloudflare

I include this article not just to warn you in case you depend on GeoTrust certificates, but also to highlight what’s involved in running a reliable and trustworthy CA.

Devon O’Brien, Ryan Sleevi, and Andrew Whalley — Google

They go over the 6 key constraints that influenced their design and describe the solution they came up with. Some of the constraints seem to involve preserving not just their own systems’ reliability, but that of their customers’ systems.

Simon Woolf — Ably

Given that we already knew in advance how to deal with each issue as it arose, it made sense to automate the work. Here’s how we did it.

James O’Keeffe — Google

In this article we will look at the various load balancing solutions available in Azure and which one should be used in which scenario.

Rahul Rajat Singh

Outages

Updated: August 5, 2018 — 9:24 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme