The logical argument goes like this: if incidents in your system each had a single root cause, that implies a level of brittleness that would preclude your company running successfully at all.
Once a system reaches a certain level of reliability, most major incidents will involve:
- A manual intervention that was intended to mitigate a minor incident, or
- Unexpected behavior of a subsystem whose primary purpose was to improve reliability
Confirmation bias can lead us to reinforce an incorrect mental model through spurious correlations.
Thai Wood — Resilience Roundup (summary)
Dennis Bernard, David Greathead, and Gordon Baxter — International Journal of Human Computer Studies (original paper)
In this post, I’ll recap his talk, sharing the journey that led them to build GoAlert, the problems they’ve solved, and how they use GoAlert with Sensu Go to simplify monitoring and reduce alert fatigue.
Anna MacLachlan — Sensu (recap)
Adam Westman — Target (talk)
Verbose debug logging + feature flagging = a way to investigate unknown unknowns in your system.