Articles
Remember all those Robinhood outages? The US financial regulatory agency is making Robinhood repay folks for the losses they sustained as a result and also fining them for other reasons.
Michelle Ong, Ray Pellecchia, Angelita Plemmer Williams, and Andrew DeSouza — FINRA
This is brilliant and I wish I’d thought of it years ago:
One of the things we’ve previously seen during database incidents is that a set of impacted tables can provide a unique fingerprint to identify a feature that’s triggering issues.
Courtney Wang — Reddit
The suggested root cause involves consolidation in cloud providers and the importance of DNS.
Alban Kwan — CircleID
Full disclosure: Fastly, my employer, is mentioned.
This paper is about recognizing normalization of deviance and techniques for dealing with it. This tidbit really made me think:
[…] they might have been taught a system deviation without realizing that it was so […]
Bus Horiz
Blameless incident analysis is often at odds with a desire to “hold people accountable”. This article explores that conflict and techniques for managing the needs involved.
Christina Tan and Emily Arnott — Blameless
What can you do if you’re out of error budget but you still want to deliver new features? Get creative.
Paul Osman — Honeycomb
I am going to go through the variation we use to up skill our on-call engineers we called “The Kobayashi Maru”, the name we borrowed from the Star Trek training exercise to test the character of Starfleet cadets.
Bruce Dominguez