Another case of “pilot error” vs “systemic problems”. It’s interesting to me how the organizational pressures the pilots were facing mirror many stories I’ve seen in tech firms, especially startups.
This article recommends improving MTTA (mean time to assemble) by modeling our dispatch systems on the emergency services for a large city.
Lots of great stuff to aspire to, with a big emphasis on observability.
Adriana Villela and Ana Margarita Medina — The New Stack
Full disclosure: Honeycomb, my employer, is mentioned.
I really love the concept of “incident legalism” introduced in this article. I’ve definitely been there.
Anyone who has coordinated over Slack during the incident has felt the pain of the ambiguity of Slack messages.
But communicating with specificity has a cost.
I remember this one! I was trying to listen to music at the time. Turns out it was DNS (and a git repo).
Erik Lindblad — Spotify
If you’re gonna group your incidents, use tags, not exclusive groups.