SRE Weekly Issue #447

A message from our sponsor, FireHydrant:

If the entire team is on a Zoom bridge during an incident – how do you know what really happened and when? We added real-time Zoom/Google Meet transcripts to make sure your incident timeline has every detail.

https://firehydrant.com/ai/

There are quite a few pitfalls waiting for you if you try to implement SLOs for your mobile app. This article explains and offers strategies.

   Virna Sekuj — The New Stack

Blamelessness in incident retrospectives can be a difficult concept to truly internalize. This article describes 3 common “failure modes”, that is, ways in which organizations struggle with blamelessness.

  Tom Elliott — The Friday Deploy

Cloudflare spends a lot of time thinking about cooling, and it’s fascinating. I didn’t realize that spinning a fan faster consumed so much more energy!

  Leslye Paniagua — Cloudflare

Explore the pitfalls associated with the excessive creation of microservices, insights on their causes, implications, and potential strategies for mitigation.

   Sumit Kumar — DZone

Netflix stores a truly obscene number of events, each of which has a timestamp and a set of key-value pairs. This article goes into a ton of detail on how they built their system.

  Rajiv Shringi, Vinay Chella, Kaidan Fullerton, Oleksii Tkachuk, and Joey Lynch — Netflix

A fun debugging story for a confusing crash bug, in which they found 6 other related bugs along the way.

  Brett Wines — Slack

My favorite one is about the principle “You Ain’t Gonna Need It”:

The flip side of YAGNI, however, is that at some point you might actually need it.

  Luc van Donkersgoed

When you create an index on multiple columns in Postgres, you’ll need to be sure that the order of the fields in the index allows it to be applied to your queries, as these folks learned.

  Jean-Mark Wright

Updated: October 20, 2024 — 9:04 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme