SRE Weekly Issue #78

SPONSOR MESSAGE

New eBook for DevOps pros: The Dev and Ops Guide to Incident Management offers 25+ pages of essential insight into building teams and improving your response to downtime.
http://try.victorops.com/SREWeekly/IM_eBook

Articles

This Master’s thesis by Crista Vesel seeks to answer the question, “How does the language used in the U.S. Forest Service’s Serious Accident Investigation Guide bias accident investigation analysis?” It’s an awe-inspiring analysis, drawing on Dekker, Woods, Cook, and other authors I’ve linked here repeatedly.

The most exciting part for me was the confirmation of some vague thoughts I’ve had around the use of passive versus active voice in retrospectives. By using passive voice, we can seek to reduce the kind of blaming that is inherent in active/agentive language.

It’s by Julia Evans. Just read it.

Being responsible for my programs’ operations makes me a better developer

PagerDuty again draws on ITIL, this time to outline an example system for classifying incident impact and urgency in order to determine priority.

PagerDuty’s take on automating chaos includes a chat-bot that lets folks trigger one-off host failures, along with running periodically, of course.

Unfortunately, ChaosCat is significantly tied into our internal infrastructure tooling. For the moment this means we won’t be open-sourcing it.

This article is an overview of Microsoft’s DRaaS offering, Azure Site Recovery. Protip: you can just scroll past the signup-gate if you don’t feel like entering your email address.

Grab evaluated a couple of existing solutions but went with a simple custom sharding layer as a method to scale out their Redis usage.

Outages

  • Rollbar
  • LinkedIn
  • Skype
    • Suspected DDoS.
  • ATO (Australian Tax Office)
  • Dyn
    • Dyn suffered a long outage, and they posted an amazing 28 detailed updates to their status site before all was said and done. That’s something to aspire to.
  • Heroku
    • Heroku posted a followup for their series of incidents early this month. Sorry for missing posting those outages when they happened!Full disclosure: Heroku is my employer.
Updated: June 25, 2017 — 9:31 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme