SRE Weekly Issue #172

A message from our sponsor, VictorOps:

[You’re Invited] Puppet, Splunk and VictorOps are teaming up for a live webinar on powering continuous improvement by combining analytics, incident response and automation. Learn best practices for releasing better applications faster, without the fire drills.


An experienced pilot and programmer details the background behind the 737 MAX’s MCAS system and discusses the risks and motivations involved.

Boeing’s solution to its hardware problem was software.

Thanks to John Goerzen for this one.

Gregory Travis — IEEE Spectrum

A detailed analysis of a paper by Eric Hollnagel and David Woods on designing systems that include humans and computers.

The operator detects failures better when he participates in system control as opposed to functioning only as a monitor…

Thai Wood (summary)

An essay on the difference in philosophies between Safety I and Safety II and on understanding how our systems succeed rather than focusing on how they fail.

Ryan Frantz

Azure’s project tardigrade is exploring interesting ideas like keeping VMs resident in memory even when the host kernel reboots. This reminds me of another similarly-named project.

Chris Kanaracus — TechTarget

This is a followup to an article from last week about a Honeycomb incident, going into more detail on what went wrong and how they figured it out using Honeycomb itself.

Douglas Soo — Honeycomb

On Feb 15th, 2019, a slave node in Redis, an in-memory data structure storage, failed requiring a replacement.


This blog post describes Grab’s post-mortem findings for the outage caused by the Redis Cluster failure.

· Michael Cartmell, Jiahao Huang, and Sandeep Kumar — Grab

I like how their chosen solution fetches from all the datacenters in the normal case, so they don’t experience a sudden shift in traffic pattern during a failover.

Preslav Le — Dropbox


Updated: May 12, 2019 — 9:13 pm
SRE WEEKLY © 2015 Frontier Theme