SRE Weekly Issue #158

A message from our sponsor, VictorOps:

The golden signals of SRE and monitoring helps identify a great starting point for teams looking to proactively build reliability into highly integrated applications and services.

http://try.victorops.com/sreweekly/sre-golden-signals

Articles

This air traffic accident analysis is chilling to listen to, and also incredibly educational. As you listen through the conversation, it becomes more and more clear that the pilot is suffering from information overload. An Incident Commander would be wise to remember the lessons learned here.

After listening to the above recording, I got hooked and kept listening to more and more case studies. Here’s another enlightening one: Real Pilot Story: From Miscue to Rescue

US Air Safety Institute

PagerDuty is quickly approaching Etsy’s level of awesome incident-related articles and guides.

Rachael Byrne — PagerDuty

Retiring features and products can often be harder to do safely than deploying them in the first place.

Rachana Kumar– Etsy

Do your SLIs measure what really matters to your customers? This article discusses how to find out and what to do if they don’t.

Adrian Hilton and Yaniv Aknin — Google

Outages

Updated: February 3, 2019 — 8:19 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme