SRE Weekly Issue #438

Are there any blind or low-vision readers out there that would be willing to answer a few questions? I’m looking to learn more about your experience of reading a newsletter like this and the articles I link to. If you’re interested, please drop me an email at lex at sreweekly dot com. Thanks!

A message from our sponsor, FireHydrant:

Migrate off of PagerDuty, save money, and then have all of your configuration exported as Terraform modules? We did that. We know one of the hardest parts of leaving a legacy tool is the old configuration, that’s why we dedicated time to build the Signals migrator, making it easy to switch.

https://firehydrant.com/blog/speedrun-to-signals-automated-migrations-are-here/

This article shows how to use timed_rotating and multirotate_set to regularly rotate credentials using Terraform.

  Andy Leap — Mixpanel

After an incident involving a database schema change, this engineer created a linting system for schema changes to catch painful ones that would cause a full table rewrite.

  Fred Hebert — Honeycomb

  Full disclosure: Honeycomb is my employer.

Finding Heroku and alternative services lacking for various reasons, these folks built their own Heroku-like platform on top of Kubernetes and migrated their service to it.

  Matheus Lichtnow — WorkOS

It’s anything but simple to handle IPv4 and IPv6 in your service. This article covers the nitty-gritty details including dual-stack resolvers and Happy Eyeballs.

  Viacheslav Biriukov

What’s great about an incident? It helps uncover latent flaws in your system, as happened to these folks during a Redis upgrade.

  Shayon Mukherjee

Tips on how to handle vendor incidents, from runbooks to incident management and post-incident review.

  Mandi Walls — PagerDuty

Cool trick:

[…] when an operational surprise happens, someone will remember “Oh yeah, I remember reading about something like this when incident XYZ happened”, and then they can go look up the incident writeup to incident XYZ and see the details that they need to help them respond.

  Lorin Hochstein

While the CAP theorem may be technically correct, the actual limitations it imposes on real-world systems have nuance.

The reality is that CAP is nearly irrelevant for almost all engineers building cloud-style distributed systems, and applications on the cloud.

  Marc Brooker

Updated: August 18, 2024 — 10:30 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme