SRE Weekly Issue #98


Attending AWS re:Invent 2017? Visit the VictorOps booth, schedule a meeting, or join us for some after hours fun. See you in Vegas!


I’ve mentioned Blackrock3 Partners here before, a team of veteran firefighters that train IT incident responders in the same Incident Management System used by firefighters and other disaster responders. Before now, they’ve only done training and consulting directly with companies.

Now, for the first time, they are opening a training session to the public, so you can learn to be an Incident Commander (IC) without having to be at a company that contracts with them. Their training will significantly up your incident response game, so don’t miss out on this. Click through for information on tickets.

Blackrock3 Partners has not provided me with compensation in any form for including this link.

This is John Allspaw’s 30-minute talk at DOES17, and it contains so much awesomeness that I really hope you’ll make time for it. Here are a couple of teasers (paraphrased):

Treat incidents as unplanned investments in your infrastructure.

Perform retrospectives not to brainstorm remediation items but to understand where your mental model of the system went wrong.

Here’s some more detail on Slack’s major outage on Halloween, in the form of a summary of an interview with their director of infrastructure, Julia Grace.

Google claims a lot with Cloud Spanner. Does it deliver? I’d really like to see a balanced, deeply technical review, so if you know of one, please drop me a link.

With this release, we’ve extended Cloud Spanner’s transactions and synchronous replication across regions and continents. That means no matter where your users may be, apps backed by Cloud Spanner can read and write up-to-date (strongly consistent) data globally and do so with minimal latency for end users.

Ever been on-call for work and your baby? I think a fair number of us can relate. Thankfully, it sounds like these folks realized that it’s not exactly a best practice to have a parent of a 5-day old premie be on call…

Here’s a nice pair of articles on fault tolerance and availability. In the first post (linked above), the author defines the terms “fault”, “error”, and “failure”. The second post starts with definitions of “availability” and “stability” and covers ways of achieving them.

John Allspaw, former CTO of Etsy and author of a ton of awesome articles I’ve featured here, is moving on to something new.

Along with Dr. Richard Cook and Dr. David Woods, I’m launching a company we’re calling Adaptive Capacity Labs, and it’s focused on helping companies (via consulting and training) build their own capacity to go beyond the typical “template-driven” postmortem process and treat post-incident review as the powerful insight lens it can be.

I’m really hoping to have an opportunity to try out their training, because I know it’s going to be awesome.


  • Heroku
    • Heroku suffered an outage caused by Daylight Saving Time, according to this incident report. Happens to someone every year.Full disclosure: Heroku is my employer.
  • Google Docs
  • Discord
Updated: November 19, 2017 — 9:10 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme