In this talk, Dr. Richard Cook presents bone as the archetype for resilient systems, and shows us what we can learn about resilience engineering from medicine.
Richard Cook, MD — Adaptive Capacity Labs
Some interesting ideas on testing in production, involving developer instances that live right inside production and take a portion of production traffic.
Keep in mind, though, that you aren’t really studying an incident at all: you’re studying your system through the lens of an incident.
This thread has an interesting analogy between alerts and code comments.
I’m really loving this thing where Adrian Colyer is going through classic works on The Morning Paper. Here’s his take on the STELLA Report.
Adrian Colyer — The Morning Paper (summary)
Woods et al. (original report)
- Full disclosure: Fastly is my employer.
- Google Drive
- There was an issue with one of the root DNS servers. More detail later in the thread.
- Apple App Store
- Honeywell Home
- Microsoft Teams
- At least, I think they had an outage. Their status site was down when I tried to verify this one.
- [Official update] Twitter down : App not working & broken for many users
- Hosted Graphite
- AWS ap-southeast-2