if we focus only on issues with individual components, there’s so much we miss about the nature of failure in our systems.
Will Gallego says that we need to prioritize and incentivize learning from near misses, not just actual incidents.
We’ve made headway into expending energy towards learning from incidents. We’ll be even better off when we can regularly make learning from successes our regular work as well.
This air crash in 1977 taught us many important lessons including surprising details about the behavior of jet engines in rain. The water ingestion testing apparatus shown in one of the photos is pretty impressive.
When your alerts cover systems owned by different teams, who should be on call?
Nathan Lincoln — Honeycomb
Full disclosure: Honeycomb is my employer.
Cloudflare does some pretty eye-opening things with the network stack and file descriptors, as described in this amusingly-named article.
Quang Luong and Chris Branch
While ostensibly about security incident response, this article has a lot of useful ideas for improving response to any kind of incident.
Jess Chang — Vanta (for incident.io)
An argument for monoliths over microservices, but with an important caveat: be careful about compartmentalizing your failure domains.
Lawrence Jones — incident.io
Here’s a great summary of the key themes from last month’s SRECon Americas.
Paige Cruz — Chronosphere