SRE Weekly Issue #35

Articles

Whoops, here’s one that got lost in my review queue. Charity Majors (one of the usual suspects here at SRE Weekly) wrote one of her characteristically detailed and experience-filled posts on how to isolate your production, staging, and development environments in AWS.

Paradigm Check Point: Prefacing Debriefings

I can’t quite tell how much of this is John Allspaw’s writing and how much is written by the US Forestry Service, but I love it all. Here’s a bulleted list of points driving home the fact that we constantly strike a balance between risk and safety.

Multi data center redundancy – application considerations

Server Density added multi-datacenter redundancy to their infrastructure in 2013, and they were kind enough to document what they learned. In this first of two articles, they outline different kinds of multi-datacenter setups and go over the kinds of things you’ll have to think about as you retrofit your application.

Making a point with SLAs

This short opinion piece raises an excellent idea: SLAs aren’t for recouping the cost you incurred due to an outage, they are for making a point to a service provider about the outage.

Cost of Southwest’s tech outage climbs to at least $54 million

Southwest has released some numbers on the impact of last month’s outage that resulted in thousands of cancelled flights.

Netflix and Fill

Netflix gives us a rundown of how they prepare a title for release by pre-filling caches in their in-house CDN. I like the part about timing pre-filling during off-peak hours to avoid impacting the service.

Delta Datacenter Crash: Do the Math on Disaster Recovery ROI

How much is your company willing to invest for a truly effective DR solution? This article asks that question and along the way digs into what an effective DR solution looks like and why it costs so much.

Outages

Syria
- The Syrian government shut internet access down to prevent cheating on school exams.
Mailgun
- Linked, find a really interesting postmortem: Mailgun experienced an outage when their domain registrar placed their domain on hold abruptly. The registrar was subsequently largely uncommunicative, hampering incident resolution. Lesson learned: make sure you can trust your registrar, because they have the power to ruin your day.
Belnet
- The linked article has some intriguing detail about a network equipment failure that caused a routing loop.
Australia’s census website
- This caught my eye:
  
  Revolution IT simulated an average sustained peak of up to 350 submissions per second, but only expected up to 250 submission per second.
  
  Load testing only 40% above expected peak demand? That seems like a big red flag to me.
Reddit
Etisalat (UAE ISP)
Vodafone
Google Drive
AT&T
Delta Airline
- A datacenter power system failure resulted in cancelled flights worldwide.

SRE Weekly Issue #35

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

SPONSOR MESSAGE

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues