SRE Weekly Issue #424

A message from our sponsor, FireHydrant:

FireHydrant is now AI-powered for faster, smarter incidents! Power up your incidents with auto-generated real-time summaries, retrospectives, and status page updates.

https://firehydrant.com/blog/ai-for-incident-management-is-here/

Here’s an ultra-practical guide to pushing for reliability investments at your company, formatted as a runbook with a set of specific steps.

  Ross Brodbeck

A neat dive into how Amazon’s MemoryDB composes multiple systems to create a redundant Redis-compatible data store.

  Marc Brooker

This article looks into the economic and psychological impact of a culture of blame.

  Lee Atchison — Blameless

It took me two read-throughs to fully get this one, and I’m reallyglad I did it.

If we only examine people’s actions in the wake of an incident, and not when things go well, then we fall into the trap of selecting on the dependent variable.

  Lorin Hochstein

To prevent dangerous deploy collisions, these folks wrote an open source tool to mediate who gets to deploy when.

  Andrew Kannan — Klaviyo

if you’ve never worked at a startup before, you may be over-estimating how much you need to learn and how quickly.

When all you have is early adopters, you’re in a more forgiving environment, including for reliability.

  Nicholas Yan — Graphite

Structured logging is great, but there can be pitfalls and gotchas.

  Oakley Hall

An intro to SLOs with useful formulas, from the creator of the SLO Calculator featured here awhile back.

  Alex Ewerlöf

Updated: May 12, 2024 — 9:57 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme