SRE Weekly Issue #487

A message from our sponsor, Spacelift:

IaC Experts! IaCConf Call for Presenters – August 27, 2025
The upcoming IaCConf Spotlight dives into the security and governance challenges of managing infrastructure as code at scale. From embedding security in your pipelines to navigating the realities of open source risk, this event brings together practitioners who are taking a security-minded approach to how they implement IaC in their organization.

Call for Presenters is now open until Friday, August 1. Submit your CFP or register for the free event today.

Join the Free Virtual Event

Pinterest decided to replace their Hadoop+Spark-based data processing pipeline with one based on Kubernetes.

In part one, we provide rationale for our new technical direction prior to outlining the overall design and detailing the application focused layer of our platform. We conclude with current status and some of our learnings.

  Soam Acharya, Rainie Li., William Tom, and Ang Zhang — Pinterest

This article raises some important concerns that are worth thinking about.

It’s fast and feels efficient, but it masks a drop in codebase familiarity. Over time, your top engineers stop being system experts.

  Alexander Procter — Okoone

I really love the care taken in this article to consider the potential risks of AI tools for incident response. There are many valuable insights that make this article way more than just a sales pitch for their tool.

  Chris Evans — incident.io

Quicksilver a globally distributed key-value store serving billions of requests per second where speed is critical, so you know the scaling challenges are going to be interesting.

  Marten van de Sanden and Anton Dort-Golts — Cloudflare

This article gives reproducible cases in which MySQL and Postgres can reuse auto-increment IDs.

I think I’ve seen this advice violated at nearly every company I’ve worked at:

Best practice dictates that you shouldn’t be using IDs from database tables outside of that table unless it’s some foreign key field

  Sam Rose

Here’s a great explanation of why it’s often better to use for_each instead of count in Terraform.

  Ned Bellavance

This debugging story really drew me in. It’s so incredibly satisfying the way their initial theory was confirmed so tidily in the end.

  Nayef Ghattas — Datadog

In our latest Rootly roundtable, we sat down with a group of seasoned SREs (collectively packing over 100 years of ops scars) to trade notes on what makes an alert useful, what makes it noise, and how to build alerting systems that teams can trust.

Here are their top strategies distilled for you:

  Jorge Lainfiesta — Rootly

Updated: July 27, 2025 — 10:53 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme