This is either a set of SRE interview topics or the squares for the SRE bingo card.
Blame awareness only works if you work towards blame awareness with all incidents, not just the ones that affect you.
a brief history of our pipeline and the platforms, why the rebuilding was necessary, what these new services look like, and how they are being used for Netflix businesses.
Liwei Guo, Anush Moorthy, Li-Heng Chen, Vinicius Carvalho, Aditya Mavlankar, Agata Opalach, Adithya Prakash, Kyle Swanson, Jessica Tweneboah, Subbu Venkatrav, Lishan Zhu — Netflix
Here are five concrete tips to fix your alerts and improve alert fatigue.
Candace Shamieh, Daljeet Sandu, and Nicolas Narbais — Datadog
This article contains guidelines for many kinds of reviews and activities SRE can do to improve reliability, such as SLO reviews, dependency reviews, and more.
However, the reality of alerting in a socio-technical system must cater not only to the mess around the signal, but also to the longer term interpretation of alerts by people and automation acting on them. This post will expand on this messiness and why Honeycomb favors an iterative approach to setting our alerts.
Fred Hebert — Honeycomb
Full disclosure: Honeycomb is my employer.
This far-ranging conversation covers many aspects of developing a reliable platform for engineering. There’s a text summary if audio’s not your thing.
Ash Patel — SREPath
Spurred by a single-AZ outage that took down their service, Slack set out to break their system into isolated segments so that an AZ can be drained of traffic quickly and without impacting customers.
Cooper Bethea — Slack