Winston is Netflix’s tool for runbook automation, based on the open source StackStorm. Winston helps reduce pager burden by filtering out false-positive alerts, collecting information for human responders, and remediating some issues automatically.

Is it valid for those working on non-life-critical systems to try to draw on lessons learned in safety-critical fields like surgery and air traffic control? John Allspaw, citing Dr. Richard Cook, answers with an emphatic yes.

The best HA infrastructure design in the world won’t save you when your credit card on file expires.

There’s a huge amount of detail on both PostgreSQL and MySQL in this article, including some sneaky edge-case pitfalls that prompted Uber to look for a new database.

This article goes into a good amount of depth on setting up a Caassandra cluster to survive a full AZ outage.

When a Maryland, US county’s emergency services went offline for two hours, 100 calls were missed, possibly contributing to two deaths. In the vein of last week’s theme of complex failures:

“This is really complex, and a lot of dominoes fell in a way that people just didn’t expect,” said Marc Elrich, chairman of the Public Safety Committee.

Here’s the third (final?) installment in this series. This one has some fascinating details on a topic near and dear to my heart: live migration of a database. Their use of DRBD and synchronous replication is especially intriguing.

Ooh, this is gonna be fun. Catchpoint and O’Reilly are hosting an AMA (Ask Me Anything) with DevOps and SRE folks, including Liz Fong-Jones and Charity Majors, both of whose articles have been featured here previously. The questions posted so far look pretty great.


