SRE Weekly Issue #35


What is Modern Incident Management? Download the Incident Management Buyer’s Guide to learn all about it, and the value it provides. Get your copy here:


Whoops, here’s one that got lost in my review queue. Charity Majors (one of the usual suspects here at SRE Weekly) wrote one of her characteristically detailed and experience-filled posts on how to isolate your production, staging, and development environments in AWS.

I can’t quite tell how much of this is John Allspaw’s writing and how much is written by the US Forestry Service, but I love it all. Here’s a bulleted list of points driving home the fact that we constantly strike a balance between risk and safety.

Server Density added multi-datacenter redundancy to their infrastructure in 2013, and they were kind enough to document what they learned. In this first of two articles, they outline different kinds of multi-datacenter setups and go over the kinds of things you’ll have to think about as you retrofit your application.

This short opinion piece raises an excellent idea: SLAs aren’t for recouping the cost you incurred due to an outage, they are for making a point to a service provider about the outage.

Southwest has released some numbers on the impact of last month’s outage that resulted in thousands of cancelled flights.

Netflix gives us a rundown of how they prepare a title for release by pre-filling caches in their in-house CDN. I like the part about timing pre-filling during off-peak hours to avoid impacting the service.

How much is your company willing to invest for a truly effective DR solution? This article asks that question and along the way digs into what an effective DR solution looks like and why it costs so much.


  • Syria
    • The Syrian government shut internet access down to prevent cheating on school exams.

  • Mailgun
    • Linked, find a really interesting postmortem: Mailgun experienced an outage when their domain registrar placed their domain on hold abruptly. The registrar was subsequently largely uncommunicative, hampering incident resolution. Lesson learned: make sure you can trust your registrar, because they have the power to ruin your day.

  • Belnet
    • The linked article has some intriguing detail about a network equipment failure that caused a routing loop.

  • Australia’s census website
    • This caught my eye:

      Revolution IT simulated an average sustained peak of up to 350 submissions per second, but only expected up to 250 submission per second.

      Load testing only 40% above expected peak demand? That seems like a big red flag to me.

  • Reddit
  • Etisalat (UAE ISP)
  • Vodafone
  • Google Drive
  • AT&T
  • Delta Airline
    • A datacenter power system failure resulted in cancelled flights worldwide.

Updated: August 14, 2016 — 10:17 pm
SRE WEEKLY © 2015 Frontier Theme