SRE Weekly Issue #427

A message from our sponsor, FireHydrant:

We’ve gone all out on our new integration with Microsoft Teams. If you’re a MS Teams user, FireHydrant now supports the most comprehensive integration for incident management. Run the entire IM process without ever leaving the chat.

Written by a GitHub employee, this article seeks to answer the titular question, with discussions of noise reduction concerns and incidents that affect only a subset of customers.

  Ross Brodbeck

Wow, this incident is a really great example of the idea that there is no one single root cause.


Understand the safeguard configuration of the ArgoCD’s ApplicationSet through the experience of our SRE who learned from an incident

  Tanat Lokejaroenlarb — Adevinta

Sometimes it’s better to do something in multiple passes, even if it’s less efficient. This applies to individual programs and major deployments alike.

  Thomas A. Limoncelli — ACM Queue

Another thought-provoking take on the argument that there is no one root cause.

  Lorin Hochstein

I referenced this at work the other day, but the interesting bit is that the pod-eviction-timeout option has been removed in Kubernetes 1.27 and I’ve had difficulty finding out what it was replaced by.

  Bhargav Bhikkaji

How to use llama-2 7b to generate summaries of your incidents, using Cloudflare workers and Workers AI.

It’s a complete how-to using an open source LLM.

  Karl Stoney

Here’s a great incident writeup from last December that I came across this week.

By the way, if you see or write an incident followup post, I’d be grateful if you sent a link my way!


Updated: June 2, 2024 — 9:46 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme