SRE Weekly Issue #41

SPONSOR MESSAGE

[WEBINAR] The Do’s and Dont’s of Post-Incident Analysis. Join VictorOps and Datadog to get an inside look at how to conduct modern post-incident analysis. Sign up now: http://try.victorops.com/l/44432/2016-09-21/f8k6rn

Articles

Trestus is a new tool to generate a status page from a Trello board. Neat idea!

Your card can include markdown like any other Trello card and that will be converted to HTML on the generated status page, and any comments to the card will show up as updates to the status (and yes, markdown works in these too).

An excellent intro to writing post-incident analysis documents is the subject of this issue of Production Ready by Mathias Lafeldt. I can’t wait for the sequel in which he’ll address root causes.

Adrian Colyer of The Morning Paper gave a talk at Operability.IO with a round-up of his favorite write-ups of operations-related papers. I really love the fascinating trend of “I have no idea what I’m doing” — tools that help us infer interconnections, causality, and root causes in our increasingly complex infrastructures. Rather than try (and in my experience, usually fail) to document our massively complicated infrastructures in the face of increasing employee turnover rates, let’s just accept that this is impossible and write tools to help us understand our systems.

And for fun, a couple of amusing tweets I came across this week:

Me: oh sorry, I got paged
Date: are you a doctor?
Me: uh
Nagios: holy SHIT this cert expires in SIXTY DAYS
Me: …yes

— Alice Goldfuss (@alicegoldfuss) (check out her awesome talk at SRECon16 about the Incident Command System)

We just accidentally nuked all our auto-scaling stuff and everything shutdown. We’re evidently #serverless now.

— Honest Status Page (@honest_update)

@mipsytipsy @ceejbot imagine you didn’t know anything about dentistry and decided we don’t need to brush our teeth any more. That’s NoOps.

— Senior Oops Engineer (@ReinH)

Netflix documents the new version of their frontend gateway system, Zuul 2. They moved from blocking IO to async, which allows them to handle persistent connections from clients and better withstand retry storms and other spikes.

The advantages of async systems sound glorious, but the above benefits come at a cost to operations. […] It is difficult to follow a request as events and callbacks are processed, and the tools to help with debugging this are sorely lacking in this area.

In last week’s issue, I linked to a chapter from Susan Fowler’s upcoming book on microservices. Here’s an article summarizing her recent talk at Velocity about the same subject: how to make microservices operable. She should know: Uber runs over 1300 microservices. Also summarized is her fellow SRE Tom Croucher’s keynote talk about outages at Uber.

In this first of a series, GitHub lays out the design of their new load balancing solution. It’s pretty interesting due to a key constraint: git clones of huge repositories can’t resume if the connection is dropped, so they need to avoid losing connections whenever possible.

I’m embarrassed to say that I haven’t yet found the time to take my copy of the SRE book from its resting place on my shelf, but here’s another review with a good amount of detail on the highlights of the book.

Live migration of VMs while maintaining TCP connections makes sense — the guest’s kernel holds all the connection state. But how about live migrating containers? The answer is a Linux feature called TCP connection repair.

The SSP story (linked here two issues ago) is getting even more interesting. They apparently decided not to switch to their secondary datacenter in order to avoid losing up to fifteen minutes’ worth of data, instead taking a week+ outage.

While, in SRE, we generally don’t have to worry about our deploys literally blowing up in our faces and killing us, I find it valuable to look to other fields to learn from how they manage risk. Here’s an article about a tragic accident at UCLA in which a chemistry graduate student was severely injured and later died. A PhD chemist I know mentioned to me that the culture of safety in academia is much less rigorous than in the industry, perhaps due in part to a differing regulatory environment.

Outages

SRE Weekly Issue #40

SPONSOR MESSAGE

Take a bite out of all things DevOps with video series, DevChops. Get easy to digest explanations of most-used DevOps terms and concepts in 90 seconds or less. Watch now: http://try.victorops.com/l/44432/2016-09-16/f7gpzp

Articles

Adrian Colyer summarizes James Hamilton’s 2007 paper in this edition of The Morning Paper. There’s a lot of excellent advice here — some I knew explicitly, some I mostly implement without thinking about it, and some I’d never thought about. The paper is great, but even if you don’t have time to read it, Colyer’s digest version is well worth a browse.

Susan Fowler (featured here a couple weeks ago) has a philosophy of failure in her life that I find really appealing as an SRE:

We can learn something about how to become the best versions of ourselves from how we engineer the best complex systems in the world of software engineering.

And while we’re on the subject of Susan Fowler, she’s got a book coming soon about writing reliable microservices. In the linked ebook-version of the second chapter, she goes over the requirements for a production-ready microservice: stability, reliability, scalability, fault-tolerance, catastrophe-preparedness, performance, monitoring, and documentation.

Pinterest explains how they broke their datastore up into 4096(!) shards on 4 pairs of MySQL servers (later 8192 on 8 pairs). It’s an interesting approach, although in essence it treats MySQL as a glorified key-value store for JSON documents.

Do you use Kerberos or similar to authenticate your SSH users? What happens if there’s an incident that’s bad enough to take down your auth infrastructure? I hadn’t realized that openSSH supports CAs, but Facebook shows us that PKI support is easy and feature-rich.

Another project from Facebook: a load balancer for DHCP. Facebook found that anycast was not distributing requests evenly across DHCP servers, so they wrote a loadbalancer in Go.

In incident post-analysis, a fundamental attribution error is a tendency to see flaws in others as a cause if they were involved in an incident, but to blame the system if we were the one involved. This 4-minute segment from the Pre-Accident Podcast explains fundamental attribution error in more detail.

411 is Etsy’s new tool that runs scheduled queries against Elasticsearch and alerts on the result.

Outages

  • ING Bank
    • Here’s a terribly interesting root cause: during a test, the fire response system emitted an incredibly loud sound while dumping an inert gas into the datacenter — probably loud enough to cause hearing damage. This caused failure in multiple key spinning hard drives. Remember shouting at hard drives?
  • Heroku Status
    • Heroku released a followup with details on last week’s outage.

      Full disclosure: Heroku is my employer.

  • Gmail for Work
  • Microsoft Azure
    • Major outage involving most DNS queries for Azure resources failing. Microsoft posted a report including a root cause analysis.

SRE Weekly Issue #39

SPONSOR MESSAGE

Got ChatOps? Download the free eBook from O’Reilly Media and VictorOps: http://try.victorops.com/devopsweekly/chatops

Want even more? Meet the author on Sept 8th in a live stream event: http://try.victorops.com/devopsweekly/chatops/livestream

Articles

A+ article! Susan Fowler has been a developer, an ops person, and now an SRE. That means she’s well-qualified to give an opinion on who should be on call, and she says that the answer is developers (in most cases). Bonus content includes “What does SRE become if developers are on call?”

[…]if you are going to be woken up in the middle of the night because a bug you introduced into code caused an outage, you’re going to try your hardest to write the best code you possibly can, and catch every possible bug before it causes an outage.

Thanks to Devops Weekly for this one.

I figured this new zine from Julia Evans would be mostly review for me. WRONG. I’d never heard of dstat, opensnoop, or execsnoop, or perf before, but I sure will be using them now. As far as I can tell, Julia wants to learn literally everything, and better yet, she wants to teach us what she learned and how she learned it. Hats off to her.

“While we’ve got the entire system down to do X, shall we do Y also?”

This article argues that we should never do Y. If something goes wrong, we won’t know whether to roll back X or Y, and it’ll take twice as long to figure out which one is to blame.

This week, Mathias introduces “system blindness”, the flawed understanding of how a system works and the lack of knowledge of how incomplete our understanding of it is. Whether we realize it or not, we struggle to mentally model the intricate interconnections in the increasingly complex systems we’re building.

There are no side effects, just effects that result from our flawed understanding of the system.

I’ve mentioned Spokes (formerly DGit) here previously. This time, GitHub shares the details on how they designed Spokes for high durability and availability.

TIL: Ruby can suffer from Java-style stop-the-world garbage collection freezes.

Here’s recap of a talk about Facebook’s “Protect Storm”, given by VP Jay Parikh at @Scale. Project Storm involved retrofitting Facebook’s infrastructure time handle the failure of entire datacenters.

“I was having coffee with a colleague just before the first drill. He said, ‘You’re not going to go through with it; you’ve done all the prep work, so you’re done, right?’ I told him, ‘There’s only one way to find out’” if it works.

Here’s an interview with Jason Hand of VictorOps about the importance of a blameless culture. He mentions the idea that “Why?” is an inherently blameful kind of question (hat tip to John Allspaw’s Infinite “How?”s). I have to say that I’m not sure I agree with Jason’s other point that we shouldn’t bother attempting incident prevention, though. Just look at the work the aviation industry has done toward accident prevention.

This article is published by my sponsor, VictorOps, but their sponsorship did not influence its inclusion in this issue.

SCALE has opened their CFP, and one of the chairs told me that they’d “love to get SRE focused sessions on open-source.”

Outages

  • British Airways
  • FLOW (Jamaica telecom)
  • SSP
    • SSP provides a SaaS for insurance companies to run their business on. They’re dealing with a ten-plus-day outage initially caused by some kind of power issue that fried their SAN. As a result, they’re going to decommission the datacenter in question.
  • Heroku
    • Full disclosure: Heroku is my employer.
  • Azure
    • Two EU regions went down simultaneously.
  • Overwatch (game)
  • Asana
    • Linked is a postmortem with an interesting set of root causes. A release went out that increased CPU usage, but it didn’t cause issues until peak traffic the next day. Asana is brave for enabling comments on their postmortem — not sure I’d have the stomach for that.Thanks to an anonymous contributor for this one.
  • ESPN’s fantasy football
    • Unfortunate timing, being down on opening day.

SRE Weekly Issue #38

Welcome to the many new subscribers that joined in the past week. I’m not sure who I have to thank for the sudden surge, but whoever you are, thanks!

SPONSOR MESSAGE

Got ChatOps? Download the free eBook from O’Reilly Media and VictorOps: http://try.victorops.com/devopsweekly/chatops

Want even more? Meet the author on Sept 8th in a live stream event: http://try.victorops.com/devopsweekly/chatops/livestream

Articles

What can the fire service learn about safety from the aviation industry? A 29-year veteran in the fire service answers that question in detail. We could in turn apply all of those lessons to operating complex infrastructures.

I’m surprised that I haven’t come across the term “High Reliability Organization” before reading the previous article. Here’s Wikipedia’s article on HROs.

A high reliability organization (HRO) is an organization that has succeeded in avoiding catastrophes in an environment where normal accidents can be expected due to risk factors and complexity.

Etsy instruments their deployment system to strike a vertical line on their graphite graphs for every deploy. This helps them quickly figure out whether a deploy happened right before a key metric took a turn for the worse.

A really interesting dive into the world of subsea network cables and the impact that cuts can have on businesses worldwide.

How closely can you really mimic production in your testing environments? In a way we’re all testing in production, and this article talks about getting that fact out in the open.

I wrote this article on my terrible little blog back in 2008 — forgive the horrid theme and apparently broken unicode support. This was well before I worked in Linden Lab’s Ops team, back when I was making a living as a user selling content in Second Life. What’s interesting to me in reading this article 8 years later is the user perspective on the impact of the string of bad outages, and especially Linden’s poor communication during outages.

More on the impact of Delta Airline’s major outage last month.

Most often a catastrophic failure is not due to a lack of standards, but a breakdown or circumvention of established procedures that compounded into a disastrous outcome. Multilayer complex systems outages signify management failure to drive change and improvement.

Outages

SRE Weekly Issue #37

SPONSOR MESSAGE

Frustrated by the lack of tools available to automate incident response? Learn how ChatOps can help manage your operations through group chat in the latest book from O’Reilly. Get your copy here: http://try.victorops.com/l/44432/2016-08-19/f2xt33

Articles

Sometimes I follow chains of references from article to article until I find a new author to follow, and this time it’s Kelly Sommers. In this gem, she debunks the rarity of network partitions by recasting them as availability partitions. If half of your nodes aren’t responding because their CPUs are pegged, you still have a network partition.

most partitions I’ve experienced have nothing to do with network infrastructure failures

Two engineers from MMO company DIGIT gave this short, nicely detailed interview in which they outline how they achieve HA on AWS.

Here’s a recording of the DevOps/SRE AMA from a couple weeks back, in case you missed it.

A blog post by Skyline, who is launching their new deployment-as-a-service offering. The intro is pretty great, outlining the inherent risks in changing code and releasing new code into production.

Other online schema-change tools I’m familiar with (e.g. pt-online-schema-change) use triggers to keep a new table in sync with changes while copying old rows over. Instead, gh-ost monitors changes by hooking on as a replication slave. Very clever! This article goes into several reasons why this is a much better approach.

Outages

SRE WEEKLY © 2015 Frontier Theme