An article on looking past human error in investigating air sports (definition) accidents, drawing on the writing of Don Norman. Special emphasis on slips versus mistakes:
“Slips tend to occur more frequently to skilled people than to novices
Mara Schmid — Blue Skies Magazine
An VP of NS1 explains how his company rewrote and deployed their core service without downtime.
Shannon Weyric — NS1
This guide from Hosted Graphite has a ton of great advice and reads almost as if they’ve released their internal incident response guidelines. Bonus content: check out this exemplary post-incident followup from their status site.
Fran Garcia — Hosted Graphite
Check it out, Atlassian posted their incident management documentation publicly!
On Monday I gave a talk at DOES18 called “All the World’s a Platform”, where I talked about a bunch of the lessons learned by using and abusing and running and building platforms at scale.
I promised to do a blog post with the takeaways, so here they are.
[…] at a certain point, it’s too expensive to keep fixing bugs because of the high-opportunity cost of building new features. You need to decide your target for stability just like you would availability, and it should not be 100%.
Kristine Pinedo — Bugsnag
Maelstrom is Facebook’s tool to assist engineers in safely moving traffic off of impaired infrastructure.
Adrian Colyer — The Morning Paper (summary)
Veeraraghavan et al. — Facebook (original paper)
Attempting to stamp out failure entirely can have the paradoxical effect of reducing resiliency to anomalous situations. Instead, we need to handle failure constructively.
Daniel Hummerdal — Safety Differently
- Postmortem: RDS Clogs & Cache-Refresh Crash Loops – Honeycomb
I guess it’s probably mean of me, but I always get excited when Honeycomb has an outage, because I love reading their followup analyses. This one expertly deconstructs a messy incident with lots of contributing factors.
Rachel Fong — Honeycomb
- GitHub had a severe outage this week. Their brief summary (linked above) brings to mind the mention of the risk of data center isolation in this article from July:
- Travis CI
- Caused by the GitHub outage.
- Also this one and a few other minor ones.
- The above is a total outage for one hour. They also had a less severe incident the previous day.