Articles
What an excellent resource! This repo contains a pile of postmortems for our reading and learning pleasure. I’m linking to the repo now, but I don’t promise not to call out specific awesome postmortems from it in the future.
When you’re in the trenches trying to get the service back up and running, it can be hard to find the time to tell everyone else in your company what’s going on. It’s critically important though, add Statuspage.io writes in this article.
Full disclosure: Heroku, my employer, is mentioned.
Digital Ocean shares this overview of the basic concepts involved in high availability.
This article discusses a method of computing the availability of an overall system made up of individual components with differing availabilities. It gives general formulas and methods that are fairly simple, yet powerful.
What do you do when you have to modify an existing production system that has less-than-wonderful code quality? This article is an impassioned plea to test the heck out of your changes and always try to release production-quality code the first time.
Google is launching a reverse-proxy for DDoS mitigation. Interestingly, it’s only for news and free speech sites and it’s completely free.
Outages
- Xbox Live
- PartyPoker
- Telenor (mobile operator)
-
This one’s interesting. Invalid signaling from another operator took down Telenor.
The unusual data sent from an international operator was misinterpreted in software from Ericsson, halting part of the mobile traffic on Telenor’s network.
-
- Office 365
- T-Mobile
- EE (UK mobile operator)
- Xero
- Verizon Wireless