SRE Weekly Issue #200

A message from our sponsor, VictorOps:

Learn how to modernize your approach to incident management and slash MTTA/MTTR in the latest webinar from VictorOps + Splunk:

https://go.victorops.com/sreweekly-modernized-incident-management

Articles

The logical argument goes like this: if incidents in your system each had a single root cause, that implies a level of brittleness that would preclude your company running successfully at all.

Lorin Hochstein

Once a system reaches a certain level of reliability, most major incidents will involve:

  • A manual intervention that was intended to mitigate a minor incident, or
  • Unexpected behavior of a subsystem whose primary purpose was to improve reliability

Lorin Hochstein

Confirmation bias can lead us to reinforce an incorrect mental model through spurious correlations.

Thai Wood — Resilience Roundup (summary)
Dennis Bernard, David Greathead, and Gordon Baxter — International Journal of Human Computer Studies (original paper)

In this post, I’ll recap his talk, sharing the journey that led them to build GoAlert, the problems they’ve solved, and how they use GoAlert with Sensu Go to simplify monitoring and reduce alert fatigue.

Anna MacLachlan — Sensu (recap)
Adam Westman — Target (talk)

Verbose debug logging + feature flagging = a way to investigate unknown unknowns in your system.

Will Sargent

Outages

Updated: December 29, 2019 — 9:42 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme