Articles
Whoops, here’s one that got lost in my review queue. Charity Majors (one of the usual suspects here at SRE Weekly) wrote one of her characteristically detailed and experience-filled posts on how to isolate your production, staging, and development environments in AWS.
I can’t quite tell how much of this is John Allspaw’s writing and how much is written by the US Forestry Service, but I love it all. Here’s a bulleted list of points driving home the fact that we constantly strike a balance between risk and safety.
Server Density added multi-datacenter redundancy to their infrastructure in 2013, and they were kind enough to document what they learned. In this first of two articles, they outline different kinds of multi-datacenter setups and go over the kinds of things you’ll have to think about as you retrofit your application.
This short opinion piece raises an excellent idea: SLAs aren’t for recouping the cost you incurred due to an outage, they are for making a point to a service provider about the outage.
Southwest has released some numbers on the impact of last month’s outage that resulted in thousands of cancelled flights.
Netflix gives us a rundown of how they prepare a title for release by pre-filling caches in their in-house CDN. I like the part about timing pre-filling during off-peak hours to avoid impacting the service.
How much is your company willing to invest for a truly effective DR solution? This article asks that question and along the way digs into what an effective DR solution looks like and why it costs so much.
Outages
- Syria
-
The Syrian government shut internet access down to prevent cheating on school exams.
-
- Mailgun
-
Linked, find a really interesting postmortem: Mailgun experienced an outage when their domain registrar placed their domain on hold abruptly. The registrar was subsequently largely uncommunicative, hampering incident resolution. Lesson learned: make sure you can trust your registrar, because they have the power to ruin your day.
-
- Belnet
-
The linked article has some intriguing detail about a network equipment failure that caused a routing loop.
-
- Australia’s census website
-
This caught my eye:
Revolution IT simulated an average sustained peak of up to 350 submissions per second, but only expected up to 250 submission per second.
Load testing only 40% above expected peak demand? That seems like a big red flag to me.
-
- Etisalat (UAE ISP)
- Vodafone
- Google Drive
- AT&T
- Delta Airline
-
A datacenter power system failure resulted in cancelled flights worldwide.
-