SRE Weekly Issue #34

Articles

Netflix: Introducing Winston – Event driven Diagnostic and Remediation Platform

Winston is Netflix’s tool for runbook automation, based on the open source StackStorm. Winston helps reduce pager burden by filtering out false-positive alerts, collecting information for human responders, and remediating some issues automatically.

High Tempo, High Consequence

Is it valid for those working on non-life-critical systems to try to draw on lessons learned in safety-critical fields like surgery and air traffic control? John Allspaw, citing Dr. Richard Cook, answers with an emphatic yes.

Expired credit cards can shut down businesses

The best HA infrastructure design in the world won’t save you when your credit card on file expires.

Why Uber Engineering Switched from Postgres to MySQL

There’s a huge amount of detail on both PostgreSQL and MySQL in this article, including some sneaky edge-case pitfalls that prompted Uber to look for a new database.

How to Setup a Highly Available Multi-AZ Cassandra Cluster on AWS EC2

This article goes into a good amount of depth on setting up a Caassandra cluster to survive a full AZ outage.

Officials try to explain 911 outage that should ‘never happen’

When a Maryland, US county’s emergency services went offline for two hours, 100 calls were missed, possibly contributing to two deaths. In the vein of last week’s theme of complex failures:

“This is really complex, and a lot of dominoes fell in a way that people just didn’t expect,” said Marc Elrich, chairman of the Public Safety Committee.

Netflix Billing Migration to AWS – Part III

Here’s the third (final?) installment in this series. This one has some fascinating details on a topic near and dear to my heart: live migration of a database. Their use of DRBD and synchronous replication is especially intriguing.

AMA DevOps & SRE

Ooh, this is gonna be fun. Catchpoint and O’Reilly are hosting an AMA (Ask Me Anything) with DevOps and SRE folks, including Liz Fong-Jones and Charity Majors, both of whose articles have been featured here previously. The questions posted so far look pretty great.

Outages

EE (UK telecom)
- EE users saw a 2-day outage when roaming.
Vocus (AU telecom)
PlayStation Network
Battle.net
123-Reg (UK web host)
Airtel
Commonwealth Bank
OGS (Online Go Server)
Neotel

SRE Weekly Issue #34

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

SPONSOR MESSAGE

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues