SRE Weekly Issue #64

Articles

I wasn’t able to make it to SRECon17 Americas this year, but it sounds like it was a great time. (day two summary)

My heroine, Julia Evans, gave the plenary session at SRECon17 Americas, all about how to learn how to be an excellent engineer (or really anything!). She proved herself once again not just as an excellent student, but also an inspiring teacher. The best part is that she posted the abstract, slides, and a transcript of her talk shortly after giving it! This is a really excellent resource for folks like me that weren’t there, and I hope more talk-givers will follow her example.

Counterfactual Thinking, Rules, and The Knight Capital Accident

This article is long, but I wish I’d carved out time for it long ago, because it’s really incredible and well worth the read. John Allspaw uses the SEC analysis of the Knight Capital incident as a starting point to introduce and discuss the problems with Counterfactual Thinking (“if the engineer had just done ___, this wouldn’t have happened”).

You Can’t Have a Rollback Button

Rolling back a flawed code release can have significant risk. It doesn’t always fix the problem because the erroneous code may have had effects on other parts of the system. Sometimes, as in the Knight Capital incident, a rollback exacerbates the problem.

The Production Environment at Google

This is part two of an annotation of the google SRE book by Stephen Thorne, a Google SRE. Part Three is available too.

Measuring Technical Debt With Incident Management Data

Here’s an interesting idea: using metadata about incidents as a proxy for measuring technical debt. PagerDuty goes over the definition of technical debt before diving into measuring it.

How is team-member-1 doing?

GitLab posted an update on “team-member-1”, the engineer that entered the commands that caused their production DB to be erased. I love that they posted this, because I for one was worried about “team-member-1” as a second victim.

U mad bro? Disaster planning for on-call – VictorOps

During an incident, emotions can run strong. How can we set up incident response so as to provide the best environment for our responders?

This article is published by my sponsor, VictorOps, but their sponsorship did not influence its inclusion in this issue.

Outages

AWS Route 53
- Route 53 had a control plane outage, though actual query responses were unaffected.
Square
- Square suffered a 2-hour outage, and if this postmortem is any indication, they learned a lot from it. This bit is interesting in light of the article above about rollbacks:
  
  We rolled back all software changes that happened leading up to the incident. This is a non-negotiable response to any customer-impacting event; our engineers are trained to undo any change that happened before an incident regardless of how plausible it is that the change caused the issue.
StatusPage.io
- This happened during Square’s outage and impacted their ability to communicate.
CBS
- CBS’s site was down, so people couldn’t fill out their fantasy sportsball brackets 1 hour before the game started.

SRE Weekly Issue #64

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

SPONSOR MESSAGE

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues