Written by a GitHub employee, this article seeks to answer the titular question, with discussions of noise reduction concerns and incidents that affect only a subset of customers.
Ross Brodbeck
Wow, this incident is a really great example of the idea that there is no one single root cause.
Understand the safeguard configuration of the ArgoCD’s ApplicationSet through the experience of our SRE who learned from an incident
Tanat Lokejaroenlarb — Adevinta
Sometimes it’s better to do something in multiple passes, even if it’s less efficient. This applies to individual programs and major deployments alike.
Thomas A. Limoncelli — ACM Queue
Another thought-provoking take on the argument that there is no one root cause.
Lorin Hochstein
I referenced this at work the other day, but the interesting bit is that the pod-eviction-timeout
option has been removed in Kubernetes 1.27 and I’ve had difficulty finding out what it was replaced by.
Bhargav Bhikkaji
How to use llama-2 7b to generate summaries of your incidents, using Cloudflare workers and Workers AI.
It’s a complete how-to using an open source LLM.
Karl Stoney
Here’s a great incident writeup from last December that I came across this week.
By the way, if you see or write an incident followup post, I’d be grateful if you sent a link my way!
Turso