SRE Weekly Issue #197

It’s been four years since I started SRE Weekly.  I’m having a ton of fun and learning a lot, and I can’t tell you all how happy it makes me that you read the newsletter.

A huge thank you to everyone who writes amazing SRE content every week.  Without you folks, SRE Weekly would be nothing.  Thanks also to everyone who sends links in — I definitely don’t catch every interesting article!

A message from our sponsor, VictorOps:

From everyone at VictorOps, we wanted to wish you a happy holiday season and give thanks for this SRE community. So, we put together this fun post to highlight the highs and lows of being on-call during the holidays.

https://go.victorops.com/sreweekly-on-call-holidays

Articles

Here’s an intro to the Learning From Incidents community. I can’t wait to see what these folks write. They’re coming out of the gate fast, with a post every day for the first week.

Nora Jones

In order to understand how things went wrong, we need to first understand how they went right

I love the move toward using the term “operational surprise” rather than “incident”.

Lorin Hochstein

Fascinating detail about the space shuttle Columbia’s accident, and the confusing jargon at NASA that may have contributed.

Dwayne A. Day — The Space Review

Google released free material (slides, handbooks, worksheets) to help you run a workshop on effective SLOs.

Lots of really interesting detail about how LinkedIn routes traffic to datacenters and what happens when a datacenter goes down.

Nishant Singh — LinkedIn

Our field is learning a ton, and it can be tempting to short-circuit that learning.  It takes time to really grok and integrate what we’re learning.

Now it may be easy to accept all of this and think “Yeah yeah, I got it. Let me at that ‘resilience’. I’m going to ‘add so much resilience’ to my system!”.

Will Gallego

I like the distinction between “unmanaged” and “untrained” incident response.Author: Jesus Climent — Google

This chronicle of learning about observability makes for an excellent reading list to those just diving in.

Mads Hartmann

Outages

Updated: December 8, 2019 — 9:20 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme