Articles
My favorite:
Don’t wait until the post-mortem; consider doing a “pre-mortem” brainstorm to identify potential issues.
John Agger — Fastly
Full disclosure: Fastly is my employer.
Let’s Encrypt deals with pretty heavy traffic. This post goes into what it takes for them to run a successful Certificate Transparency log.
Phil Porada — Let’s Encrypt
In this air traffic radio recording from Las Vegas (2018), the air traffic controller becomes impaired and starts issuing confusing and dangerously incorrect directives. The pilots work together to correct the situation and no accidents occur. This is a classic example of a resilient system.
I don’t normally link to posts that heavily cover product offerings, but this one has some real gems. I especially like the discussion toward the end of the importance of analyzing an incident shortly after it happens.
John Allspaw — Adaptive Capacity Labs
This is a striking analogue for an infrastructure with many unactionable alerts.
The commission has estimated that of the thousands of alarms going off throughout a hospital every day, an estimated 85 to 99 percent do not require clinical intervention.
Melissa Bailey — The Washington Post
A fascinating look at the early days of Etsy, in which a system is rewritten, the rewrite blows up, the rewrite is rewritten, and finally that is rewritten again. Ouch.
Dan McKinley (@mcfunley)
If your DR test involves carefully contrived circumstances that don’t match the real world, then it’s not a real test. Point your upper management at this article if you need to argue for true DR testing.
Ivan Pepelnjak
Outages
- GitHub
- BNZ (bank)
- Bank of Ireland
- Rakuten
- IndiGo (airline)
- Tinder
- Amino App
- Costco
- Nordstrom Rack
- Facebook and Instagram
- This one happened on the US’s Thanskgiving Day.
- Telsa App
- ABC News website
- An outage resulted in articles from 2011 being served to visitors.
- Heroku
- SquareSpace
- NatWest Bank
- Thanks to Dr. Richard Cook for this one.