SRE Weekly Issue #147

Articles

Honeycomb’s Charity Majors: Go Ahead, Test in Production

This is an excellent summary of a talk on testing in production last month.

“Distributed systems are incredibly hostile to being cloned or imitated, or monitored or staged,” she said. “Trying to mirror your staging environment to production is a fool’s errand. Just give up.”

Joab Jackson — The New Stack

DBMS Musings: Distributed consistency at scale: Spanner vs. Calvin

The Pros and Cons of Calvin and Spanner, two data-store papers published in 2012. According to the author, Calvin largely stands out as the favorite.

Daniel Abadi

RobinHood: tail latency aware caching – dynamic reallocation from cache-rich to cache-poor

What a cool concept!

RobinHood brings SLO violations down to 0.3%, compared to 30% SLO violations under the next best policy.

Adrian Colyer — The Morning Paper (summary)

Berger et al. (original paper)

Cross shard transactions at 10 million requests per second

With thousands(!) of MySQL shards, Dropbox needed a way to have transactions span multiple shards while maintaining consistency.

Daniel Tahara — Dropbox

Heatmaps Make Ops Better

This is an excellent introduction to heatmaps with some hints on how to interpret a couple common patterns.

Danyel Fisher — Honeycomb

How Automatic Root Cause Analysis Works

This is a neat idea. By modelling the relationships between the components in your infrastructure, you can figure out which one might be to blame when everything starts alerting at once. Note: this article is heavily geared toward Instana.

Steve Waterworth — Instana

Getafix: How Facebook tools learn to fix bugs automatically

Automated bug fixing seems to be all the rage lately. I wonder, is it practical for companies that aren’t the size of Facebook or Google?

Johannes Bader, Satish Chandra, Eric Lippert, and Andrew Scott — Facebook

Outages

Slack in Europe
Netflix
Instagram
Microsoft’s Windows license activation service
- Microsoft has acknowledged a problem affecting its Windows license activation servers in multiple countries that has resulted in users being told their Windows 10 Pro and Enterprise installations are invalid.
Lloyds Bank
GPS ankle bracelets in Australia
- A violent parolee is on the run after part of the GPS tracking system broke down due to Telstra’s network issues on Friday and Saturday.

SRE Weekly Issue #147

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

A message from our sponsor, VictorOps:

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues