SRE Weekly Issue #405

A message from our sponsor, FireHydrant:

In this episode of FireHydrant’s Gimme 5 video series, Asaf Gaon, Director of Technical Support for automated grocery fulfillment solution Takeoff Technologies, talks about how to handle third-party downtime in a collaborative – and automated – way. https://firehydrant.com/blog/gimme-5-with-takeoff-technologies-asaf-gaon/

Using the Swedish word “Lagom” as a jumping-off point, this article explains the importance of choosing an SLO that is just right: not too lax and not too strict.

  Alex Ewerlöf

A simple security change like ceasing to use IMDSv1 can involve profound risk and necessitate a major migration process.

  Archie Gunasekara — Slack

It can be all too easy to let a subset of your IT organization “handle” resiliency. If resilience is about an ability to adapt and respond to change, then it needs broad buy-in.

  Richard Gall — The New Stack

If any seemingly innocuous change can break our systems, what should we do?

  Lorin Hochstein

What exactly is “human error”?

  Steven Shorrock — Humanistic Systems

We recently upgraded from Postgres 11.9 to 15.3 with zero downtime by using logical replication, a suite of support scripts, and tools in Elixir & Erlang’s BEAM virtual machine.

They share a ton of details about how they did it.

  Brent Anderson — Knock

Why do doctors still use antiquated pagers? There’s a lot here that speaks to what it’s really like to operate in an on-call environment, and how to evaluate new tools.

  Fred Hebert

This article riffs on Murphy’s law, exploring various aspects of how things go wrong using anecdotes.

   Bertrand Florat

Updated: December 31, 2023 — 9:36 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme