SRE Weekly Issue #465

A message from our sponsor, incident.io:

On-call shouldn’t be a constant source of stress. On Feb 26 at 1 PM EST, join us to hear from teams who’ve moved from PagerDuty to incident.io On-call—reducing noise, improving alerting, and making on-call less painful. Insights from engineers who’ve been there.

https://go.incident.io/events/migrating-from-pagerduty

An incident report from the vault, along with its accompanying blog post, involving a rare but serious kernel freeze on GCP.

  Jake Cooper — Railway

Let’s discuss logging – unstructured, structured and canonical log lines – what they are and what value they bring to your production systems.

This one includes an example of implementing a logging system in an example project.

  Obakeng Mosadi

This article aims to answer one question: How can Redis be used as a primary database for complex applications that need to store data in multiple formats?

It covers persistence and scaling options, including Redis Enterprise’s built-in CRDTs.

   Mohammed Talib

In this blog post we’re going to explore how the hung task warning works, why it happens, whether it is a bug in the Linux kernel or application itself, and whether it is worth monitoring at all.

  Oxana Kharitonova and Jesper Brouer — Cloudflare

This post discusses key preconditions for building resilience, including resources, flexibility, expertise, diversity, and coordination.

  Lorin Hochstein

So the main problem with blameful postmortems is not the blame. It’s the very idea that particular decisions can be categorically unsafe.

  u/devoopseng — Reddit r/sre

This may be the shortest article I’ve ever linked to here, but it’ll make you think.

  Dean Wilson

If you use SLOs at all levels in your system, a failure of a core part (like the DB) may page multiple teams. This article offers strategies to handle this better.

  Fred Hebert — Honeycomb

Updated: February 23, 2025 — 8:52 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme