SRE Weekly Issue #71


Resolving DevOps and IT incidents is not enough. Download the eBook: “Blameless Post Mortems (and how to do them)”, and start learning from them.


The interesting bit in this story is that upgrading to 5.7 requires a full table rewrite (<tt>ALTER TABLE</tt>) for any table that has time-related columns. Their initial test-run took months and still hadn’t finished.

AdStage made the move from Heroku to running their service directly on EC2, and in this article they explain why and how.

We were officially only getting about 2 ECUs per dyno, but the reality was that we were getting something closer to 6 since our neighbors on Heroku were not using their full share. This meant that our fleet of AWS instances was 3 times too small, […]

Language Warning: contains the word “sexy” used to describe new or interesting technology.

Full disclosure: Heroku, my employer, is mentioned.

I’ve featured many articles from Mathias Lafeldt as part of his series, Production Ready. Now that he’s moved to Gremlin Inc (a SaaS helping customers run chaos experiments), Mathias reintroduces the history and theory of Chaos Engineering.

The folks behind implemented their own master-master replication system on top of Tarantool, a DBMS I’d never heard of. Their implementation is based on some details of their use-case that may not apply more broadly, but the design discussion is interesting nonetheless.

Facebook rewrote their tool, OnlineSchemaChange in Python (from the original PHP). OSC is a tool for doing DDL in MySQL without downtime.

The original open sourced OSC was more like an engine than a tool. Users needed to write PHP code wrapping to run the schema change, and, with PHP becoming less popular in the operations world, OSC.php wasn’t widely adopted by the community.

From PagerDuty, an article on the incident management data to gather, how to gather it, and how to analyze it.

A basic introduction to structured logging, including rationale on why you’d want to use it. With infrastructures growing more and more complicated, I find structured logging indispensable in keeping everything up and running and debugging difficult problems.

For the network nerds, Facebook details their new inter-datacenter network topology.

New in the latest version of Elastic Stack (think ElasticSearch, Logstash, Kibana, etc) is built-in anomaly detection using machine learning, based on technology from Prelert (acquired by Elastic in 2016). “Machine Learning” — they might as well say it’s powered by “Lasers™”. If you try this out and have any success, please write up your results and send me a link!


Updated: May 7, 2017 — 10:26 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme