SRE Weekly Issue #466

A bit of a short issue this week, as I spent most of my weekend at my child’s first First Robotics Competition of the season. FRC truly is a microcosm of reliability engineering, balancing limited time and resources while trying to produce the most reliable bot possible.

A message from our sponsor, incident.io:

What does “good” incident management look like? MTTx metrics track speed, but speed alone doesn’t mean success. We analyzed 100,000+ incidents from companies of all sizes to identify benchmarks for every stage of the incident lifecycle. See how your team stacks up.

https://go.incident.io/good-incident-management-report

Just because Google, Amazon, or Facebook does it doesn’t mean you should. Here are four ‘best practices’ of the hyperscalers you have permission to ignore.

  Matt Asay — InfoWorld

An introduction to distributed transactions using the Saga pattern, including pros and cons and two approaches for implementing sagas.

  Sid — Scalable Thread

Here’s an argument against real-world “war rooms” for incident response, including a great incident story as an example.

I can’t imagine doing that kind of multi-window parallel investigation stuff on a teeny little laptop screen with people right next to me on either side

  rachelbythebay

This one includes a list of responsibilities a lead incident responder has and another list of things they should delegate.

Incident lead isn’t an extra job that you do “on top of” engineering. It’s the main job.

  r/devoopseng — Reddit r/sre

Scaling Elasticsearch requires balancing sharding, query performance, and memory tuning for optimal efficiency in high-traffic, real-time applications.

   Vivek Kumar — DZone

Updated: March 2, 2025 — 9:56 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme