SRE Weekly Issue #71

Articles

How we Upgraded a 22TB MySQL Cluster from 5.6 to 5.7 (in 9 months)

The interesting bit in this story is that upgrading to 5.7 requires a full table rewrite (<tt>ALTER TABLE</tt>) for any table that has time-related columns. Their initial test-run took months and still hadn’t finished.

Migrating from Heroku to AWS: Our Story

AdStage made the move from Heroku to running their service directly on EC2, and in this article they explain why and how.

We were officially only getting about 2 ECUs per dyno, but the reality was that we were getting something closer to 6 since our neighbors on Heroku were not using their full share. This meant that our fleet of AWS instances was 3 times too small, […]

Language Warning: contains the word “sexy” used to describe new or interesting technology.

Full disclosure: Heroku, my employer, is mentioned.

The Discipline of Chaos Engineering

I’ve featured many articles from Mathias Lafeldt as part of his series, Production Ready. Now that he’s moved to Gremlin Inc (a SaaS helping customers run chaos experiments), Mathias reintroduces the history and theory of Chaos Engineering.

Homegrown master-master replication for a NoSQL database

The folks behind Mail.ru implemented their own master-master replication system on top of Tarantool, a DBMS I’d never heard of. Their implementation is based on some details of their use-case that may not apply more broadly, but the design discussion is interesting nonetheless.

OnlineSchemaChange rebuilt in Python

Facebook rewrote their tool, OnlineSchemaChange in Python (from the original PHP). OSC is a tool for doing DDL in MySQL without downtime.

The original open sourced OSC was more like an engine than a tool. Users needed to write PHP code wrapping to run the schema change, and, with PHP becoming less popular in the operations world, OSC.php wasn’t widely adopted by the community.

After the Disaster: How to Learn from Historical Incident Management Data

From PagerDuty, an article on the incident management data to gather, how to gather it, and how to analyze it.

What is Structured Logging?

A basic introduction to structured logging, including rationale on why you’d want to use it. With infrastructures growing more and more complicated, I find structured logging indispensable in keeping everything up and running and debugging difficult problems.

Building Express Backbone: Facebook’s new long-haul network

For the network nerds, Facebook details their new inter-datacenter network topology.

Introducing Machine Learning for the Elastic Stack

New in the latest version of Elastic Stack (think ElasticSearch, Logstash, Kibana, etc) is built-in anomaly detection using machine learning, based on technology from Prelert (acquired by Elastic in 2016). “Machine Learning” — they might as well say it’s powered by “Lasers™”. If you try this out and have any success, please write up your results and send me a link!

Outages

WhatsApp
Churchill Downs
Mumsnet
Cloudflare Status – Network Performance Issues in multiple locations
Telia
- Telia, a major backbone internet provider, deployed a misconfiguration that caused routing issues across the globe. CloudFlare noticed, as did Pingdom and Discord. Think back to almost a year ago, and you may remember that this isn’t the first time that they’ve caused this kind of far-reaching problem.

SRE Weekly Issue #71

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

SPONSOR MESSAGE

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues