SRE Weekly Issue #175

A message from our sponsor, VictorOps:

Looking to go serverless? Beau Christensen, VictorOps Director of Platform Engineering, and Tom McLaughlin, Founder of ServerlessOps, sat down to talk about when VictorOps decided to venture into AWS:

http://try.victorops.com/SREWeekly/going-serverless

Articles

This and other enlightened reflections on incident reviews can be found in this article:

Many organizations have driven post-incident reviews to become pallid, vapid, mechanical exercises whose value is limited to producing a defensible argument that management  is occurring.

Richard Cook — Adaptive Capacity Labs

In this post, I’ll describe how we monitor our DNS systems and how we used an array of tools to investigate and fix an unexpected spike in DNS errors that we encountered recently.

Jeff Jo — Stripe

“Multi-cloud” never really lived up to its hype, did it? This article argues that a multi-cloud strategy is only useful in specific, constrained situations.

Disco Posse

I love how they used idempotency to avoid downtime and missed or repeated transactions during the cutover.

Miguel Carranza — RevenueCat

This is either really clever or just unsporting.

Tonya Garcia — MarketWatch

This article discusses six kinds of SRE team (“kitchen sink”, infrastructure, tools, product/application, embedded, and consulting) and the pros and cons of each.

Gustavo Franco and Matt Brown — Google

If you see half the incidents this quarter compared to last, does it actually mean anything, statistically speaking? The math in this article applies equally well to SRE, and casts a shadow on the idea of tracking “metrics” like MTTR.

Marloes Nitert — Safety Differently

This field guide to debugging is the synthesis of a bunch of contributions by folks on Twitter, forged into an article by the inimitable Julia Evans. Maybe a zine is in the works?

Julia Evans

Outages

Updated: June 30, 2019 — 8:26 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme