SRE Weekly Issue #35

SPONSOR MESSAGE

What is Modern Incident Management? Download the Incident Management Buyer’s Guide to learn all about it, and the value it provides. Get your copy here: http://try.victorops.com/l/44432/2016-08-04/dwp8lc

Articles

Whoops, here’s one that got lost in my review queue. Charity Majors (one of the usual suspects here at SRE Weekly) wrote one of her characteristically detailed and experience-filled posts on how to isolate your production, staging, and development environments in AWS.

I can’t quite tell how much of this is John Allspaw’s writing and how much is written by the US Forestry Service, but I love it all. Here’s a bulleted list of points driving home the fact that we constantly strike a balance between risk and safety.

Server Density added multi-datacenter redundancy to their infrastructure in 2013, and they were kind enough to document what they learned. In this first of two articles, they outline different kinds of multi-datacenter setups and go over the kinds of things you’ll have to think about as you retrofit your application.

This short opinion piece raises an excellent idea: SLAs aren’t for recouping the cost you incurred due to an outage, they are for making a point to a service provider about the outage.

Southwest has released some numbers on the impact of last month’s outage that resulted in thousands of cancelled flights.

Netflix gives us a rundown of how they prepare a title for release by pre-filling caches in their in-house CDN. I like the part about timing pre-filling during off-peak hours to avoid impacting the service.

How much is your company willing to invest for a truly effective DR solution? This article asks that question and along the way digs into what an effective DR solution looks like and why it costs so much.

Outages

  • Syria
    • The Syrian government shut internet access down to prevent cheating on school exams.

  • Mailgun
    • Linked, find a really interesting postmortem: Mailgun experienced an outage when their domain registrar placed their domain on hold abruptly. The registrar was subsequently largely uncommunicative, hampering incident resolution. Lesson learned: make sure you can trust your registrar, because they have the power to ruin your day.

  • Belnet
    • The linked article has some intriguing detail about a network equipment failure that caused a routing loop.

  • Australia’s census website
    • This caught my eye:

      Revolution IT simulated an average sustained peak of up to 350 submissions per second, but only expected up to 250 submission per second.

      Load testing only 40% above expected peak demand? That seems like a big red flag to me.

  • Reddit
  • Etisalat (UAE ISP)
  • Vodafone
  • Google Drive
  • AT&T
  • Delta Airline
    • A datacenter power system failure resulted in cancelled flights worldwide.

Updated: August 14, 2016 — 10:17 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme