Articles
If you missed the STELLA Report, released last fall during Velocity NYC by John Allspaw, Richard Cook, and David Woods, this podcast is a great intro. And even if you did catch it, it’s well worth a listen. The Food Fight folks interview John Allspaw and there’s some really stellar (see what I did there) back-and-forth discussion.
Alan Kraft and Nathen Harvey
Great idea. This reminds me of a couple jobs back where I rigged up our infrastructure to log every command entered at the shell into a Slack channel.
Rachel Kroll
This excerpt from the Google SRE book is worth reading if only for this nifty idea for graceful degradation:
Other techniques include […] choosing a consistent subset of clients to receive errors, preserving a good user experience for the remainder.
In part two of this story, the author causes their first incident (oops) and subsequently significantly improves the performance of the system in question (cool!).
Evan Smith — Hosted Graphite
An introduction to blue/green deployments including the risks it helps to alleviate.
Mark Henke — Rollout.io
instead of giving guidelines on how and when to do things, I am going to lay out a few ideas on how to respond to alerts and leave it up to you to decide what methods work best for your app and your organization.
Peter Christian Fraedrich — Capital One
Especially in Ubuntu, it’s harder than it used to be to get a core dump, thanks to apport and the like.
Julia Evans
NCDEX, a stock exchange in Mumbai, India, has been operating out of its disaster recovery site for two weeks. Unfortunately, it looks like performance is not on par with the standard site.
Rajesh Bhayani — Business Standard
You may have heard that a Southwest flight suffered a catastrophic engine failure that left one passenger dead. The day after my family flew a Southwest flight to Orlando. Yikes.
The air traffic control audio recording is incredible to listen to. The pilot that was on the radio was cool and calm as she responded to the incident and arranged for landing and emergency ground crews.
Outages
- IRS (US tax system)
- The IRS had to extend the deadline for Americans to file their taxes as a result of an overload and outage in their electronic tax filing system.
- TSB (bank)
- Heroku
- Also this one.
- Google Cloud Pub/Sub
- Woolworth’s (grocery store chain)
- Discord
- Fortnite (game)
- I normally don’t include games, but this outage is amusing because downtime on Fortnite apparently causes a surge in traffic to a popular adult site, threatening their availability.
- Telegram
- TSX (Montreal stock exchange)