I missed last week to set up a new swing set for my kids (gotta give ’em something to do while they’re stuck at home). I’m still a bit behind on articles and I’ll catch up over the next couple weeks.
Articles
The “messy” details of our human/computer systems is their hidden strength.
Lorin Hochstein
In this accident report, learn how two pilots lost situational awareness, with disastrous consequences.
Air Safety Institute
Without a structured strategy, and careful consideration of the full SLO lifecycle, SLOs risk partial implementation. This can result in low ROI and, in many cases, a complete failure.
Danny Mican — Squadcast
The cloud’s multiple availability zones and regions can be powerful, but it’s hard to get a multi-region architecture correct.
Serhat Can — OpsGenie
A useful little JavaScript tool: plug in an availability percentage (e.g. 99.99%), and get back the number of minutes you can be down in a day, month, quarter, or year.
Hexadecimal
Azure Pipelines had an incident of delayed builds at the end of March. Find out more in this post-incident analysis.
Chad Kimes – Microsoft
Google published another book in their SRE series. This short summary gives an overview of what’s inside along with an explanation of the motivation for another book. See also: Google’s announcement
Todd Hoff — High Scalability
The pendulum is swinging back, and folks are starting to see the downsides of a plethora of microservices, including early champions, Uber.
Todd Hoff — High Scalability
Outages
- Quibi
- Quibi had issues on their launch day.
- Deliveroo
- Google Cloud Platform IAM
- Click through for their interesting post-incident analysis.
- Cloudflare
- Here’s their post-incident analysis that details a remote hands request gone awry.
- Chef
- Hulu
- Lots of Banks in the US
- Banks went down around the time when customers were checking to see if their economic stimulus payments had arrived.
- Petnet (smart pet feeder)
- Snapchat
- Fastly
- DoorDash
- StackPath