SRE Weekly Issue #200

Articles

The logical argument goes like this: if incidents in your system each had a single root cause, that implies a level of brittleness that would preclude your company running successfully at all.

Lorin Hochstein

A conjecture on why reliable systems fail

Once a system reaches a certain level of reliability, most major incidents will involve:

A manual intervention that was intended to mitigate a minor incident, or

Unexpected behavior of a subsystem whose primary purpose was to improve reliability

Lorin Hochstein

When mental models go wrong. Co-occurrences in dynamic, critical systems

Confirmation bias can lead us to reinforce an incorrect mental model through spurious correlations.

Thai Wood — Resilience Roundup (summary)
Dennis Bernard, David Greathead, and Gordon Baxter — International Journal of Human Computer Studies (original paper)

Reducing alert fatigue with GoAlert, Target’s on-call scheduling and notification platform

In this post, I’ll recap his talk, sharing the journey that led them to build GoAlert, the problems they’ve solved, and how they use GoAlert with Sensu Go to simplify monitoring and reduce alert fatigue.

Anna MacLachlan — Sensu (recap)
Adam Westman — Target (talk)

Targeted Diagnostic Logging in Production

Verbose debug logging + feature flagging = a way to investigate unknown unknowns in your system.

Will Sargent

SRE Weekly Issue #200

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

A message from our sponsor, VictorOps:

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues