SRE Weekly Issue #64

SPONSOR MESSAGE

Got ChatOps? This 75 page ebook from O’Reilly MEdia covers ChatOps from concept to deployment. Get started managing operations in group chat today. Download your free copy here: http://try.victorops.com/sreweekly/chatops

Articles

I wasn’t able to make it to SRECon17 Americas this year, but it sounds like it was a great time. (day two summary)

My heroine, Julia Evans, gave the plenary session at SRECon17 Americas, all about how to learn how to be an excellent engineer (or really anything!). She proved herself once again not just as an excellent student, but also an inspiring teacher. The best part is that she posted the abstract, slides, and a transcript of her talk shortly after giving it! This is a really excellent resource for folks like me that weren’t there, and I hope more talk-givers will follow her example.

This article is long, but I wish I’d carved out time for it long ago, because it’s really incredible and well worth the read. John Allspaw uses the SEC analysis of the Knight Capital incident as a starting point to introduce and discuss the problems with Counterfactual Thinking (“if the engineer had just done ___, this wouldn’t have happened”).

Rolling back a flawed code release can have significant risk. It doesn’t always fix the problem because the erroneous code may have had effects on other parts of the system. Sometimes, as in the Knight Capital incident, a rollback exacerbates the problem.

This is part two of an annotation of the google SRE book by Stephen Thorne, a Google SRE. Part Three is available too.

Here’s an interesting idea: using metadata about incidents as a proxy for measuring technical debt. PagerDuty goes over the definition of technical debt before diving into measuring it.

GitLab posted an update on “team-member-1”, the engineer that entered the commands that caused their production DB to be erased. I love that they posted this, because I for one was worried about “team-member-1” as a second victim.

During an incident, emotions can run strong. How can we set up incident response so as to provide the best environment for our responders?

This article is published by my sponsor, VictorOps, but their sponsorship did not influence its inclusion in this issue.

Outages

  • AWS Route 53
    • Route 53 had a control plane outage, though actual query responses were unaffected.
  • Square
    • Square suffered a 2-hour outage, and if this postmortem is any indication, they learned a lot from it. This bit is interesting in light of the article above about rollbacks:

      We rolled back all software changes that happened leading up to the incident. This is a non-negotiable response to any customer-impacting event; our engineers are trained to undo any change that happened before an incident regardless of how plausible it is that the change caused the issue.

  • StatusPage.io
    • This happened during Square’s outage and impacted their ability to communicate.
  • CBS
    • CBS’s site was down, so people couldn’t fill out their fantasy sportsball brackets 1 hour before the game started.
Updated: March 19, 2017 — 10:58 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme