SRE Weekly Issue #202

Articles

Conveying confusion without confusing the reader

When writing about an incident, it’s important to skillfully show the reader how the participants’ understanding of the situation evolved.

Lorin Hochstein

The Morning Paper: Ironies of automation

This is a summary of Bainbridge’s seminal paper, and I really love where Adrian Colyer goes with it.

One example I found myself thinking about while reading through the paper does have a human precedence though: self-driving cars.

Adrian Colyer — The Morning Paper (summary)

Bainbridge — Automatica (original paper)

Sharing SQLite databases across containers is surprisingly brilliant

I have to admit, it is brilliant. Why add the risk (and latency) of a centralized configuration repository service when a local DB on each host will do?

Rick Branson — Segment

Managing Failure Modes in Microservice Architectures

This one covers a lot. My favorite parts:

Permissive failure — if Netflix’s subscriber information service is down they just show videos for free, favoring reliability over correctness.
Human attention span — if it takes 10 minutes to see if your changes broke production, you’re likely to wander off and work on something else.

Adrian Cockcroft

Understanding Observability

The author guides you through the moment they began to truly understand what observability is all about. Worth reading even if you’re already quite familiar with the concept.

Sanjeev Sharma

Intelligent DNS based load balancing at Dropbox

This article describes our work with NS1 to optimize our intelligent DNS-based global load balancing for corner cases that we uncovered while improving our point of presence (PoP) selection automation for our edge network.

How We Prevented App Performance Degradation From Sudden Ride Demand Spikes

Grab uses bulkheading to prevent localized demand spikes from affecting the service for customers elsewhere. The notable part is that they shed load they can’t satisfy anyway, due to a limited supply of available vehicles.

Corey Scott — Grab

Outages

Dyn
- Dyn had a delay in DNS resolution in London.
Google Cloud Platform (update on December 18 outage)
- On Wednesday, 18 December, 2019, a part of Google’s production network experienced a temporary reduction in capacity, due to multiple fiber cuts in optical links interconnecting Sofia, Bulgaria with other points-of-presence.
Travelex
Twitter
Airbnb
Thingiverse
Southwest Airlines website
Yahoo Mail
Disney+
QuickBooks
Trello
Reddit

SRE Weekly Issue #202

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

A message from our sponsor, VictorOps:

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues