Last week, I came across Lorin Hochstein and started to read through his blog. Lorin has a lot of awesome stuff to say, as you can see in this issue. Thanks, Lorin!
“in aviation safety, it’s like we’ve been trying to learn about marriage by only studying divorce.”
￼Kristy Kiernan — Forbes
Use the right tool for the job, not the coolest one.
In line with last week’s article on patience by Will Gallego, this one emphasizes the importance of continued learning about resilience engineering.
Here are some really thought-provoking tips on how (and why) to write an effective post-incident analysis.
To get better at avoiding or mitigating future incidents, you need to understand the conditions that enabled past incidents to occur. Counterfactual reasoning is actively harmful for this, because it circumvents inquiry into those conditions.
Some great observations and questions related to the Cloudflare outage in July.
Sometimes, things are off, and you just know an incident is brewing. What is this skill, and how can we learn it?
Silvia Botros — Learning From Incidents
- Google Drive
- Slack System Status
- Hosted Chef
- Azure CDN and Azure Kubernetes Service
- Full disclosure: Fastly is my employer.