SRE Weekly Issue #471

A message from our sponsor, incident.io:

We’re building an AI agent that investigates incidents with you—diagnosing the problem and even fixing it. Go behind the scenes with the incident.io engineers rethinking what’s possible with AI, one ambitious idea (and bug) at a time.

https://go.incident.io/building-with-ai

The author of this one draws a line between their two interests of formal methods and resilience engineering, and I’m so here for it.

  Lorin Hochstein

In this part of the Scaling Nextdoor’s Datastores blog series, we’ll explore how the Core-Services team at Nextdoor serializes database data for caching while ensuring forward and backward compatibility between the cache and application code.

  Ronak Shah — Nextdoor

MySQL’s ALTER TABLE INPLACE has limitations and downsides, and INSTANT does too, as explained in this article.

  Shlomi Noach — Planetscale

If you have multiple different types of work in your system, a queue per type of work may be a good choice.

Bonus(?): includes a bathroom-based analogy.

  Marc Brooker

One Lambda function per URL path? Or a monolithic function that handles multiple paths? There are benefits and drawbacks to each.

  Yan Cui

Published on April 1.

The truth is, many incidents move faster when there’s executive oversight — a sense of urgency, pressure, and someone repeatedly asking, “What’s the ETA?”

  Chris Evans — incident.io

  This article is published by my sponsor, incident.io, but their sponsorship did not influence its inclusion in this issue.

I’m seeing a lot of echoes of Bainbridge’s Ironies of Automation in this article about AIOps and AI tooling. If AI handles most coding and incidents, how will humans handle the outliers?

  Hamed Silatani — Uptime Labs

I wasn’t able to make it, so I really appreciate this recap. Sounds like SRECon was, unsurprisingly, heavily focused on AI this time around.

  Niall Murphy

Updated: April 6, 2025 — 9:23 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme