SRE Weekly Issue #41

SPONSOR MESSAGE

[WEBINAR] The Do’s and Dont’s of Post-Incident Analysis. Join VictorOps and Datadog to get an inside look at how to conduct modern post-incident analysis. Sign up now: http://try.victorops.com/l/44432/2016-09-21/f8k6rn

Articles

Trestus is a new tool to generate a status page from a Trello board. Neat idea!

Your card can include markdown like any other Trello card and that will be converted to HTML on the generated status page, and any comments to the card will show up as updates to the status (and yes, markdown works in these too).

An excellent intro to writing post-incident analysis documents is the subject of this issue of Production Ready by Mathias Lafeldt. I can’t wait for the sequel in which he’ll address root causes.

Adrian Colyer of The Morning Paper gave a talk at Operability.IO with a round-up of his favorite write-ups of operations-related papers. I really love the fascinating trend of “I have no idea what I’m doing” — tools that help us infer interconnections, causality, and root causes in our increasingly complex infrastructures. Rather than try (and in my experience, usually fail) to document our massively complicated infrastructures in the face of increasing employee turnover rates, let’s just accept that this is impossible and write tools to help us understand our systems.

And for fun, a couple of amusing tweets I came across this week:

Me: oh sorry, I got paged
Date: are you a doctor?
Me: uh
Nagios: holy SHIT this cert expires in SIXTY DAYS
Me: …yes

— Alice Goldfuss (@alicegoldfuss) (check out her awesome talk at SRECon16 about the Incident Command System)

We just accidentally nuked all our auto-scaling stuff and everything shutdown. We’re evidently #serverless now.

— Honest Status Page (@honest_update)

@mipsytipsy @ceejbot imagine you didn’t know anything about dentistry and decided we don’t need to brush our teeth any more. That’s NoOps.

— Senior Oops Engineer (@ReinH)

Netflix documents the new version of their frontend gateway system, Zuul 2. They moved from blocking IO to async, which allows them to handle persistent connections from clients and better withstand retry storms and other spikes.

The advantages of async systems sound glorious, but the above benefits come at a cost to operations. […] It is difficult to follow a request as events and callbacks are processed, and the tools to help with debugging this are sorely lacking in this area.

In last week’s issue, I linked to a chapter from Susan Fowler’s upcoming book on microservices. Here’s an article summarizing her recent talk at Velocity about the same subject: how to make microservices operable. She should know: Uber runs over 1300 microservices. Also summarized is her fellow SRE Tom Croucher’s keynote talk about outages at Uber.

In this first of a series, GitHub lays out the design of their new load balancing solution. It’s pretty interesting due to a key constraint: git clones of huge repositories can’t resume if the connection is dropped, so they need to avoid losing connections whenever possible.

I’m embarrassed to say that I haven’t yet found the time to take my copy of the SRE book from its resting place on my shelf, but here’s another review with a good amount of detail on the highlights of the book.

Live migration of VMs while maintaining TCP connections makes sense — the guest’s kernel holds all the connection state. But how about live migrating containers? The answer is a Linux feature called TCP connection repair.

The SSP story (linked here two issues ago) is getting even more interesting. They apparently decided not to switch to their secondary datacenter in order to avoid losing up to fifteen minutes’ worth of data, instead taking a week+ outage.

While, in SRE, we generally don’t have to worry about our deploys literally blowing up in our faces and killing us, I find it valuable to look to other fields to learn from how they manage risk. Here’s an article about a tragic accident at UCLA in which a chemistry graduate student was severely injured and later died. A PhD chemist I know mentioned to me that the culture of safety in academia is much less rigorous than in the industry, perhaps due in part to a differing regulatory environment.

Outages

Updated: September 25, 2016 — 10:11 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme