SRE Weekly Issue #400

A message from our sponsor, FireHydrant:

How is FireHydrant building its alerting tool, Signals, to be robust, lightning-fast, and configurable to how YOU work? In this edition, of their Captain’s Log, they dive into CEL and how they’re using it to handle routing and logic.

The network is not reliable. What are the implications and what can we do about it?

  Anadi Misra

Beyond a run-of-the-mill severity levels article, this one goes into a couple of common pitfalls.

  Jonathan Word

Some good tips in here, esp. the one about brevity.

  Ashley Sawatsky — Rootly


Or, Eleven things we have learned as Site Reliability Engineers at Google

   Adrienne Walcer, Kavita Guliani, Mikel Ward, Sunny Hsiao, and Vrai Stacey — Google

Good lessons to learn here that apply more broadly than just EKS.

  Christian Alexánder Polanco Valdez — Adevinta

This article is about project management, but a lot of the skills discussed apply to aspects of SRE at Staff+ levels.

  Sannie Lee — Thoughtworks (via

Now this is more like it: there’s a healthy does of skepticism woven through this article, including things genAI probably won’t be good for, and potential pitfalls.

  Jesse Robbins — Heavybit

There are two different ways of alerting on SLOs, for two very different audiences, as explained in this article. Ostensibly this is a product feature announcement, but you don’t need to be using the product to get a lot out of this.

  Fred Hebert — Honeycomb
  Full disclosure: Honeycomb is my employer.

Updated: November 26, 2023 — 9:13 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme