SRE Weekly Issue #442

A message from our sponsor, FireHydrant:

FireHydrant has acquired Blameless! The addition of Blameless’ enterprise capabilities combined with FireHydrant’s platform creates the most comprehensive enterprise incident management solution in the market.

https://firehydrant.com/blog/press-release-firehydrant-acquires-blameless-to-further-solidify-enterprise/

Here’s a hands-on evaluation of the SLO offerings of three big players in the space. The author includes screenshots of their tests and shares their opinions on each.

  Alex Ewerlöf

🔥🔥🔥  Can calling yourself an SRE be a liability?

  rachelbythebay

This article outlines some options for combining multiple SLIs together. I like the emphasis on ensuring that the result provides a useful overview without sacrificing too much.

  Ali Sattari

Lorin Hochstein proposes a rubric for judging whether a company truly is “safety first” in terms of preventing outages.

  Lorin Hochstein

In this blog, we’ll present four strategies for successfully managing reliability while adopting Kubernetes.

  Andre Newman — Gremlin

I haven’t seen a migration like this before. They managed a slow transition from an old system to a new one, keeping data in sync even though the two applications had entirely different database systems.

   Claudio Guidi and Giovanni Cuccu — DZone

[…] what if instead of spending 20 years developing various approaches to dealing with asynchronous IO (e.g. async/await), we had instead spent that time making OS threads more efficient, such that one wouldn’t need asynchronous IO in the first place?

  Yorick Peterse

I love a multi-level complex failure.

[…] during this disruption, a secondary issue caused automated failover to not work, rendering the entire metadata storage unavailable despite two other healthy zones being available.

  Google

Updated: September 15, 2024 — 9:29 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme