SRE Weekly Issue #48

This is the first issue of SRE Weekly going out to over 1000 email subscribers! Thanks all of you for continuing to make my little side project so rewarding and fulfilling. I can’t believe I’m almost at a year.

Speaking of which, there won’t be an issue next week while my family and I are vacationing at Disney World. See you in two weeks!

SPONSOR MESSAGE

Downtime sucks. Learn how leading minds in tech respond to outages on the Nov. 16th “Ask Me Anything” from Catchpoint & O’Reilly Media: http://try.victorops.com/AMA

Articles

A detailed description of Disaster Recovery as a Service (DRaaS), including a discussion of the cost versus creating a DR site oneself. This is the part I always wonder about:

However, for larger enterprises with complex infrastructures and larger data volumes spread across disparate systems, DRaaS has often been too complicated and expensive to implement.

This one’s so short I can almost quote the whole thing here. I love its succinctness:

Regardless of what we discover, we understand and truly believe that everyone did the best job they could, given what they knew at the time, their skills and abilities, the resources available, and the situation at hand.

Just over four years ago, Amazon had a major outage in Elastic Block Store (EBS). Did you see impact? I sure did. Here’s Netflix’s account of how they survived the outage mostly unscathed.

I’m glad to see more people writing that Serverless != #NoOps. This article is well-argued even though it turns into an OnPage ad 3 paragraphs from the end.

What else can we expect from Greater Than Code + Charity Majors? This podcast is 50 minutes of awesome, and there’s a transcription, too! Listen/read for awesome phrases like “stamping out chaos”, find out why Charity says, “I personally hate [the term ‘SRE’] (but I hate a lot of things)”, and hear Conway’s law applied to microservices, #NoOps debunking, and a poignant ending about misogyny and equality.

Microsoft released its Route 53 competitor in late September. They say:

Azure DNS has the scale and redundancy built-in to ensure high availability for your domains. As a global service, Azure DNS is resilient to multiple Azure region failures and network partitioning for both its control plane and DNS serving plane.

This issue of BWH Safety Matters details an incident in which a communication issue between teams that don’t normally work together resulted in a patient injury. This is exactly the kind of pitfall that becomes more prevalent with the move toward microservices, as siloed teams sometimes come into contact only during an incident.

A detailed postmortem from an outage last month. Lots of takeaways, including one that kept coming up: test your emergency tooling before you need to use it.

Outages

  • Canada’s immigration site
    • I’m sure this is indicative of something.
  • Office 365
  • Twitter
    • Twitter stopped announcing their AS in BGP worldwide, resulting in a 30-minute outage on Monday.
  • Google BigQuery
    • Google writes really great postmortems! Here’s one for a 4-hour outage in BigQuery on November 8, posted on November 11. Fast turnaround and an excellent analysis. Thanks, Google — we appreciate your hard work and transparency!
  • Pingdom
    • Normally I wouldn’t include such a minor outage, but I love the phrase “unintended human error” that they used. Much better than the intended kind.
  • WikiLeaks
  • eBay
Updated: November 13, 2016 — 9:07 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme