SRE Weekly Issue #170

A message from our sponsor, VictorOps:

Our latest list of the top 12 server monitoring tools can help your SRE team get started in building a comprehensive monitoring strategy. Drive deeper service reliability through effective server monitoring:

http://try.victorops.com/sreweekly/top-server-monitoring-software

Articles

This myth is a misguided belief that engineers are like Laplace’s Demon; they maintain an accurate mental model of the system, foresee all the consequences of their actions, predict where the business is going, and are careful enough to avoid mistakes.

Aaron Blohowiak — Netflix

I highly recommend watching some of the talks or at least perusing slides.

The concern is that incidents have been investigated by parties that were involved or related to the incident, raising concerns about conflicts of interest. In a small company, avoiding this kind of thing may not be possible, but we should at least keep the risks in mind.

Patrick Kingsland — Railway Technology

An absolute treasure trove of links to many articles and papers on resilience engineering. Beyond just links, there are short profiles of 30+ important thinkers in the field. I’m going to be busy for awhile.

@lorin (GitHub)

This is about project retrospectives, but it applies equally well to incident retrospectives.

Dominika Bula — Red Hat

Here’s a counterpoint to an article I linked to last week.

Karl Bode — Motherboard

Outages

Updated: April 28, 2019 — 8:17 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme