Last week, I came across Lorin Hochstein and started to read through his blog. Lorin has a lot of awesome stuff to say, as you can see in this issue. Thanks, Lorin!
Articles
“in aviation safety, it’s like we’ve been trying to learn about marriage by only studying divorce.”
Kristy Kiernan — Forbes
Use the right tool for the job, not the coolest one.
Mattias Geniar
In line with last week’s article on patience by Will Gallego, this one emphasizes the importance of continued learning about resilience engineering.
Lorin Hochstein
Here are some really thought-provoking tips on how (and why) to write an effective post-incident analysis.
Lorin Hochstein
To get better at avoiding or mitigating future incidents, you need to understand the conditions that enabled past incidents to occur. Counterfactual reasoning is actively harmful for this, because it circumvents inquiry into those conditions.
Lorin Hochstein
Some great observations and questions related to the Cloudflare outage in July.
Lorin Hochstein
Sometimes, things are off, and you just know an incident is brewing. What is this skill, and how can we learn it?
Silvia Botros — Learning From Incidents
Outages
- Google Drive
- Slack System Status
-
Some of our customers are seeing an ‘Error: 9DCE38C9695E’ message after attempting to send a message in Slack.
The plot thickens.
Oh, and Slack had another incident too.
-
- Hosted Chef
- Azure CDN and Azure Kubernetes Service
- GitHub
- TikTok
- Spotify
- Discord
- Fastly
- Full disclosure: Fastly is my employer.