If you focus too narrowly on preventing the specific details of the last incident, you’ll fail to identify the more general patterns that will enable your future incidents.
Lorin Hochstein
An interesting thought: scaffolding our software systems to make them more robust might actually hamper our sociotechnical system’s overall resilience. I love the horticultural analogy.
Stuart Rimell — Uptime Labs
As LLM services become more prevalent, traditional infrastructure metrics like availability and latency are no longer sufficient on their own to measure reliability. What should we use instead?
T-sato — Mercari
Here’s a primer on chaos testing in Kubernetes, including a tutorial on using CNCF’s LitmusChaos tool to perform chaos experiments in your cluster. It’s more than just a tutorial, because it covers theoretical topics like chaos testing anti-patterns.
Josephine Eskaline Joyce — DZone
The problem space seems simple, but the theme here is scale: simple solutions just don’t work in an infrastructure the size of Datadog’s.
Gabriel Reid — Datadog
This second installment focuses on operational complexity and strategic decision-making for large-scale initiatives. The article covers when to use formal programs versus working groups, how to leverage prioritization to reduce operational burden, and strategies for phased rollouts that balance technical complexity with agility.
Konstantin Rohleder — HelloFresh
This article challenges the assumption that popular DevOps practices are universally beneficial, arguing that teams should evaluate whether practices like Kubernetes, SLOs, or GitOps actually solve their specific problems rather than adopting them because “everyone else does.”
Tom Elliott — The Friday Deploy
This short post covers: * Why does this distinction matter? * An illustration to build a memorable base * Quotes from Google’s books
Alex Ewerlöf