SRE Weekly Issue #444

A message from our sponsor, FireHydrant:

FireHydrant has acquired Blameless! The addition of Blameless’ enterprise capabilities combined with FireHydrant’s platform creates the most comprehensive enterprise incident management solution in the market.

https://firehydrant.com/blog/press-release-firehydrant-acquires-blameless-to-further-solidify-enterprise/

When you’re doing something 60 million times per second, even a modest optimization makes a huge difference.

  Kevin Guthrie — Cloudflare

Meet Pushy, Netflix’s websocket-based push system with an impressive five nines of reliability in message delivery.

  Karthik Yagna, Baskar Odayarkoil, and Alex Ellis — Netflix

If your early-stage startup can’t afford an observability solution from a vendor, you could try rolling your own. This article has an overview and pointers but stops short of explicit instructions.

  Malay Hazarika — Osuite

With AI SRE “agents” cropping up everywhere, what should we think? Here’s an overview of what’s going on with links to read more.

  Clay Smith — Montoring Monitoring

An overview of the two kinds of RabbitMQ queues along with performance numbers from load tests.

   Josephine Eskaline Joyce and Anilkumar Mallakkanavar — DZone

In this blog post, I’ll discuss the evolution of our Chef infrastructure over the years and the challenges we encountered along the way.

  Archie Gunasekara — Slack

Using LLMs to generate test cases to test an AI agent’s ability to diagnose Kubernetes problems, with a kubectl simulator running on an LLM. Whew, that’s a lot of AI!

  Jeffrey Tsaw — Parity

I was having some major FOMO last week, so this recap of the SEV0 incident management conference is especially welcome.

  Amin Astaneh — Certo Modo

Updated: September 29, 2024 — 8:39 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme