SRE Weekly Issue #448

A message from our sponsor, FireHydrant:

Practice Makes Prepared: Why Every Minor System Hiccup Is Your Team’s Secret Training Ground.

https://firehydrant.com/blog/the-hidden-value-of-lower-severity-incidents/

They traded their monolith for microservices in a quest for scalability, but they got complexity along with it.

   Jennifer Riggins — The New Stack

Here’s a great summary of the difference between mutable and immutable infrastructure, including a detailed analysis of the pros and cons of each.

   Josephine Eskaline Joyce and Umar Ali — DZone

An introduction to incident severity and SEV1 incidents, along with how to respond to them, how to prevent them, and how to learn from them.

  Kate Bernacchi-Sass — incident.io

Long-running spans can be difficult to deal with, but fortunately Hazel Weakly is here with an explanation and some tips.

  Hazel Weakly — The New Stack

Here’s a debugging odyssey for a truly gnarly Jupyter Notebook problem that caused slowness in very specific and (seemingly unrelated) circumstances.

  Hechao Li and Marcelo Mayworm — Netflix

Beyond just “What went well?” in an incident writeup, Lorin urges examining our incidents to see what they can tell us about how work gets done and what adaptations people have made in our systems.

  Lorin Hochstein

A huge primer on wide events in observability: what they are, how to implement them, how to use them, and a ton of examples of the kinds of fields you might want to include in your events.

  Jeremy Morrell

  Full disclosure: Honeycomb, my employer, is mentioned.

The 2024 DORA Report is out, and the folks at Rootly have some thoughts on the interesting bits for SREs including AI, platform engineering, and burnout.

  Jorge Lainfiesta — Rootly

Updated: October 27, 2024 — 9:14 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme