SRE Weekly Issue #88

SPONSOR MESSAGE

Acknowledge and resolve IT & DevOps alerts directly from Slack with the new native integration with VictorOps. Learn all about it here:
http://try.victorops.com/slack/SREWeekly

Articles

From Catie McCaffrey:

I’m often asked how to get started with Distributed Systems, so this post documents my path and some of the resources I found most helpful. It is by no means meant to be an exhaustive list.

Julia Evans just blew my mind (once again). In this article, among other things, she links to a tool that tells you which function in the kernel dropped a packet. I’ve been wishing for such a tool for years!

I love that companies are starting to publish lessons learned from game days and other chaos experiments. Just like a post-incident followup, there’s so much we can learn by following along.

It’s an absolute must for any disaster recovery plan worth its name to include power supply as a crucial factor – because, without power, you simply can’t do business.

Here’s the last installment of Jason Hand’s digest version of his new eBook, Post-Incident Reviews.

If I leave you with one take-away from this guide, it should be that every incident provides an opportunity for your team to be more prepared for the next one.

This article is published by my sponsor, VictorOps, but their sponsorship did not influence its inclusion in this issue.

How can you prevent a colo failure? Obviously, colo customers can’t, but we can at least prepare. This article has advice for understanding a provider’s history, policies, and procedures related to outages.

Just click through.

In this analysis of the factors leading to a plane crash, we see another example of the critical role that human/computer interfaces play in allowing (or preventing) humans to recover from a system failure.

Move over, backhoes: water is the other natural enemy of the fiber optic network.

The New York Times has a Kafka installation containing everything they’ve published in their entire history, and it powers the front page, search, suggestions, and everything else.

Outages

  • AbeBooks.com
    • AbeBooks is the place to go for out-of-print books and old editions. The site going down meant that many used booksellers lost a major sales outlet.
  • Gmail
  • Apple developer portal
  • Google Drive
  • iCloud Mail
  • Heroku
    • Heroku posted a pile of public followups this past week:
      • Incidents 1251 and 1254 – In both of these incidents, applications failed due to missing debian packages normally provided by the Heroku platform.
      • Incident 1257 – For a few minutes, 10% of requests to Heroku applications hosted in Europe failed.
      • Incident 1270 – Applications last deployed over 3 years ago spontaneously stopped working.

      Full disclosure: Heroku is my employer.

Updated: September 10, 2017 — 10:00 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme