SRE Weekly Issue #147

A message from our sponsor, VictorOps:

Alert fatigue creates confusion, causes undue stress on your team, and hurts the overall reliability of the services you build. See how you can mitigate alert fatigue and build more reliable systems while making people happier:

http://try.victorops.com/sreweekly/effects-of-incident-alert-fatigue

Articles

This is an excellent summary of a talk on testing in production last month.

“Distributed systems are incredibly hostile to being cloned or imitated, or monitored or staged,” she said. “Trying to mirror your staging environment to production is a fool’s errand. Just give up.”

Joab Jackson — The New Stack

The Pros and Cons of Calvin and Spanner, two data-store papers published in 2012. According to the author, Calvin largely stands out as the favorite.

Daniel Abadi

What a cool concept!

RobinHood brings SLO violations down to 0.3%, compared to 30% SLO violations under the next best policy.

Adrian Colyer — The Morning Paper (summary)

Berger et al. (original paper)

With thousands(!) of MySQL shards, Dropbox needed a way to have transactions span multiple shards while maintaining consistency.

Daniel Tahara — Dropbox

This is an excellent introduction to heatmaps with some hints on how to interpret a couple common patterns.

Danyel Fisher — Honeycomb

This is a neat idea. By modelling the relationships between the components in your infrastructure, you can figure out which one might be to blame when everything starts alerting at once. Note: this article is heavily geared toward Instana.

Steve Waterworth — Instana

Automated bug fixing seems to be all the rage lately. I wonder, is it practical for companies that aren’t the size of Facebook or Google?

Johannes Bader, Satish Chandra, Eric Lippert, and Andrew Scott — Facebook

Outages

Updated: November 11, 2018 — 7:51 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme