Root Cause Analysis is a flawed concept, and seeking it almost inevitably results in treating people unfairly. I like the concept of “Least Effort to Remediate” introduced in this article.
Casey Rosenthal — Verica
Slack developed a load simulation tool and used it to verify a new feature, Enterprise Key Management
Serry Park, Arka Ganguli, and Joe Smith
After reviewing the history of the term “antifragility”, this article explains why it is a flawed concept and contrasts it with Chaos Engineering.
This is where the concept of antifragility veers from a truism into bad advice.
A routine data migration was found to have locked the primary database, causing request timeouts for all inbound requests.
- Heroku: Followup for Incident #1821
- A routine update caused unexpected downtime.
- Google Cloud Platform networking
- Hosted Graphite
- US Customs and Border Patrol
- London Stock Exchange
- Stack Exchange Outage Postmortem
- Here’s a followup for the Stack Exchange outage reported here previously.