SRE Weekly Issue #441

A message from our sponsor, FireHydrant:

FireHydrant has acquired Blameless! The addition of Blameless’ enterprise capabilities combined with FireHydrant’s platform creates the most comprehensive enterprise incident management solution in the market.

https://firehydrant.com/blog/press-release-firehydrant-acquires-blameless-to-further-solidify-enterprise/

This post aims to shed some light on why we migrated to Prometheus, as well as outline some of the technical challenges we faced during the process.

  Eddie Bracho — Mixpanel

Amazon posted this thorough summary of a multi-service outage at the end of July. The impact stems from a complex distributed system failure in Kinesis.

  Amazon

This team shows what they did to ferret out and eliminate occurrences of N+1 DB queries triggered by a single request in their Django app.

  Gonzalo Lopez — Mixpanel

The folks at incident.io share about how they baked observability into the infrastructure for their new on-call tool.

Note for folks using screen readers: there’s a picture without alt-text that contains the following important text:

  1. Overview dashboard
  2. System dashboard
  3. Logs
  4. Tracing

It’s right after this sentence:

Those pieces fit together something like this:

  Martha Lambert — incident.io

An overview of DST, which was a new concept for me. It’s about running simulations to try to find faults in a distributed system.

  Phil Eaton

If you build software that people depend on and are not operationally responsible for it (particularly on-call): you should be. 🛑

I like the way this one draws from the author’s experience, plus the emphasis on feedback loops.

  Amin Astaneh

Retries help increase service availability. However, if not done right, they can have a devastating impact on the service and elongate recovery time.

   Rajesh Pandey

Keepalive pings are critical in any system that uses TCP, since connections can hang at any point. I’ve been meaning to write this one for years!

  Lex Neva — Honeycomb

  Full disclosure: Honeycomb is my employer.

Updated: September 8, 2024 — 11:08 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme