SRE Weekly Issue #400

A message from our sponsor, FireHydrant:

How is FireHydrant building its alerting tool, Signals, to be robust, lightning-fast, and configurable to how YOU work? In this edition, of their Captain’s Log, they dive into CEL and how they’re using it to handle routing and logic.
https://firehydrant.com/blog/captains-log-how-were-leveraging-cel/

The network is not reliable. What are the implications and what can we do about it?

  Anadi Misra

Beyond a run-of-the-mill severity levels article, this one goes into a couple of common pitfalls.

  Jonathan Word

Some good tips in here, esp. the one about brevity.

  Ashley Sawatsky — Rootly

Subtitle:

Or, Eleven things we have learned as Site Reliability Engineers at Google

   Adrienne Walcer, Kavita Guliani, Mikel Ward, Sunny Hsiao, and Vrai Stacey — Google

Good lessons to learn here that apply more broadly than just EKS.

  Christian Alexánder Polanco Valdez — Adevinta

This article is about project management, but a lot of the skills discussed apply to aspects of SRE at Staff+ levels.

  Sannie Lee — Thoughtworks (via martinfowler.com)

Now this is more like it: there’s a healthy does of skepticism woven through this article, including things genAI probably won’t be good for, and potential pitfalls.

  Jesse Robbins — Heavybit

There are two different ways of alerting on SLOs, for two very different audiences, as explained in this article. Ostensibly this is a product feature announcement, but you don’t need to be using the product to get a lot out of this.

  Fred Hebert — Honeycomb
  Full disclosure: Honeycomb is my employer.

Updated: November 26, 2023 — 9:13 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme