SRE Weekly Issue #408

A message from our sponsor, FireHydrant:

It’s time for a new world of alerting tools that prioritize engineer well-being and efficiency. The future lies in intelligent systems that are compatible with real life and use conditional rules to adapt and refine thresholds, reducing alert fatigue.
https://firehydrant.com/blog/the-alert-fatigue-dilemma-a-call-for-change-in-how-we-manage-on-call/

This is either a set of SRE interview topics or the squares for the SRE bingo card.

  Lorin Hochstein

Blame awareness only works if you work towards blame awareness with all incidents, not just the ones that affect you.

  Will Gallego

a brief history of our pipeline and the platforms, why the rebuilding was necessary, what these new services look like, and how they are being used for Netflix businesses.

  Liwei Guo, Anush Moorthy, Li-Heng Chen, Vinicius Carvalho, Aditya Mavlankar, Agata Opalach, Adithya Prakash, Kyle Swanson, Jessica Tweneboah, Subbu Venkatrav, Lishan Zhu — Netflix

Here are five concrete tips to fix your alerts and improve alert fatigue.

  Candace Shamieh, Daljeet Sandu, and Nicolas Narbais Datadog

This article contains guidelines for many kinds of reviews and activities SRE can do to improve reliability, such as SLO reviews, dependency reviews, and more.

  Jamie Allen

However, the reality of alerting in a socio-technical system must cater not only to the mess around the signal, but also to the longer term interpretation of alerts by people and automation acting on them. This post will expand on this messiness and why Honeycomb favors an iterative approach to setting our alerts.

  Fred Hebert — Honeycomb
  Full disclosure: Honeycomb is my employer.

This far-ranging conversation covers many aspects of developing a reliable platform for engineering. There’s a text summary if audio’s not your thing.

  Ash Patel — SREPath

Spurred by a single-AZ outage that took down their service, Slack set out to break their system into isolated segments so that an AZ can be drained of traffic quickly and without impacting customers.

  Cooper Bethea — Slack

Updated: January 21, 2024 — 10:05 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme