SRE Weekly Issue #41

Articles

Trestus is a new tool to generate a status page from a Trello board. Neat idea!

Your card can include markdown like any other Trello card and that will be converted to HTML on the generated status page, and any comments to the card will show up as updates to the status (and yes, markdown works in these too).

Writing Your First Postmortem

An excellent intro to writing post-incident analysis documents is the subject of this issue of Production Ready by Mathias Lafeldt. I can’t wait for the sequel in which he’ll address root causes.

The Morning Paper on Operability

Adrian Colyer of The Morning Paper gave a talk at Operability.IO with a round-up of his favorite write-ups of operations-related papers. I really love the fascinating trend of “I have no idea what I’m doing” — tools that help us infer interconnections, causality, and root causes in our increasingly complex infrastructures. Rather than try (and in my experience, usually fail) to document our massively complicated infrastructures in the face of increasing employee turnover rates, let’s just accept that this is impossible and write tools to help us understand our systems.

Tweets

And for fun, a couple of amusing tweets I came across this week:

Me: oh sorry, I got paged
Date: are you a doctor?
Me: uh
Nagios: holy SHIT this cert expires in SIXTY DAYS
Me: …yes

— Alice Goldfuss (@alicegoldfuss) (check out her awesome talk at SRECon16 about the Incident Command System)

We just accidentally nuked all our auto-scaling stuff and everything shutdown. We’re evidently #serverless now.

— Honest Status Page (@honest_update)

@mipsytipsy @ceejbot imagine you didn’t know anything about dentistry and decided we don’t need to brush our teeth any more. That’s NoOps.

— Senior Oops Engineer (@ReinH)

Zuul 2: The Netflix Journey to Asynchronous, Non-Blocking Systems

Netflix documents the new version of their frontend gateway system, Zuul 2. They moved from blocking IO to async, which allows them to handle persistent connections from clients and better withstand retry storms and other spikes.

The advantages of async systems sound glorious, but the above benefits come at a cost to operations. […] It is difficult to follow a request as events and callbacks are processed, and the tools to help with debugging this are sorely lacking in this area.

Uber hits bumps in the road with microservices challenges

In last week’s issue, I linked to a chapter from Susan Fowler’s upcoming book on microservices. Here’s an article summarizing her recent talk at Velocity about the same subject: how to make microservices operable. She should know: Uber runs over 1300 microservices. Also summarized is her fellow SRE Tom Croucher’s keynote talk about outages at Uber.

Introducing the GitHub Load Balancer

In this first of a series, GitHub lays out the design of their new load balancing solution. It’s pretty interesting due to a key constraint: git clones of huge repositories can’t resume if the connection is dropped, so they need to avoid losing connections whenever possible.

Book Review: Site Reliability Engineering – How Google Runs Production Systems

I’m embarrassed to say that I haven’t yet found the time to take my copy of the SRE book from its resting place on my shelf, but here’s another review with a good amount of detail on the highlights of the book.

TCP connection repair

Live migration of VMs while maintaining TCP connections makes sense — the guest’s kernel holds all the connection state. But how about live migrating containers? The answer is a Linux feature called TCP connection repair.

SSP accused of making ‘wrong call’ over decision not to use secondary data centre after outage

The SSP story (linked here two issues ago) is getting even more interesting. They apparently decided not to switch to their secondary datacenter in order to avoid losing up to fifteen minutes’ worth of data, instead taking a week+ outage.

Learning From UCLA

While, in SRE, we generally don’t have to worry about our deploys literally blowing up in our faces and killing us, I find it valuable to look to other fields to learn from how they manage risk. Here’s an article about a tragic accident at UCLA in which a chemistry graduate student was severely injured and later died. A PhD chemist I know mentioned to me that the culture of safety in academia is much less rigorous than in the industry, perhaps due in part to a differing regulatory environment.

Outages

Destiny (game)
Pokemon GO
- Not to be confused with poke Mongo.
Pingdom
- An outage in both the admin/API and actual monitoring.
ASX (Australia Stock Exchange)
Global Switch (London datacenter)
Phoenix (civil service pay system)
- The system used by Canada to pay its civil servants went down the day before payday.
T-Mobile US
Fonality

SRE Weekly Issue #41

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

SPONSOR MESSAGE

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues