SRE Weekly Issue #113


Grafana and VictorOps help teams visualize time series metrics across incident management. Here’s what you need to know:


The best kind of engineer is one that understands not only their specialty, but at least something about the fields adjacent to theirs. The empathy this confers allows one to work incredibly effectively across the company. For SREs, this is even more important.

[…] many of us are finding that the most valuable skill sets sit at the intersection of two or more disciplines.

Charity Majors — Honeycomb

GitLab held a session about recognizing and preventing burnout at their recent employee summit. They share the best tips in this article, and true to their radically open culture, they also added what they learned to their employee handbook, which is publicly available.

Clement Ho — GitLab

Here’s a post-analysis for a Travis CI incident early last year. Despite a couple of easy targets that could have been labelled as “root cause”, they instead skillfully laid out all of the contributing factors and left it at that.

Travis CI

What indeed? The same thing, just organized differently. There’s a lot of great analysis here about how ops roles can adapt to a serverless infrastructure, and how teams can best make use of ops folks.

Tom McLaughlin — ServerlessOps

Charity Majors wants you to look forward to on-call. This superb write-up of her recent conference talk explains why folks should think of on-call as an enjoyable privilege and how to shape your on-call to get there.

Jennifer Riggins

The Canary Analysis Service is Google’s internal tool that automatically analyzes canary runs and decides whether performance has been negatively impacted. My favorite section is the Lessons Learned.

Štěpán Davidovič with Betsy Beyer — ACM Queue


  • Snapchat
  • 123 Reg (hosting provider)
    • Customers lost files added since 123 Reg’s last valid backup from August, 2017.
  • partypoker
  • eBay
  • Signal and Telegram (messenger apps)
  • Alexa
    • I missed this one last week — it was apparently due to the AWS outage I reported on.
  • TD Bank
  • Oculus Rift
    • A code-signing certificate expired, rendering some existing VR headsets non-functional. Oculus is issuing a $15 store credit to affected customers.

      Because of the particulars of what expired and how it happened, the company wasn’t able to simply push an update out to users because the expired certificate was blocking Oculus’ standard software update system.

Updated: March 11, 2018 — 8:30 pm
SRE WEEKLY © 2015 Frontier Theme