SRE Weekly Issue #442

Here’s a hands-on evaluation of the SLO offerings of three big players in the space. The author includes screenshots of their tests and shares their opinions on each.

Alex Ewerlöf

“SRE” doesn’t seem to mean anything useful any more

🔥🔥🔥 Can calling yourself an SRE be a liability?

rachelbythebay

Aggregating SLIs

This article outlines some options for combining multiple SLIs together. I like the emphasis on ensuring that the result provides a useful overview without sacrificing too much.

Ali Sattari

Safety first!

Lorin Hochstein proposes a rubric for judging whether a company truly is “safety first” in terms of preventing outages.

Lorin Hochstein

Reliability recommendations when adopting Kubernetes

In this blog, we’ll present four strategies for successfully managing reliability while adopting Kubernetes.

Andre Newman — Gremlin

Two Multi-Master DBs Aligned With a Vector Clock

I haven’t seen a migration like this before. They managed a slow transition from an old system to a new one, keeping data in sync even though the two applications had entirely different database systems.

Claudio Guidi and Giovanni Cuccu — DZone

Asynchronous IO: the next billion-dollar mistake?

[…] what if instead of spending 20 years developing various approaches to dealing with asynchronous IO (e.g. async/await), we had instead spent that time making OS threads more efficient, such that one wouldn’t need asynchronous IO in the first place?

Yorick Peterse

Google Cloud Incident Report: September 7 incident in asia-northeast1

I love a multi-level complex failure.

[…] during this disruption, a secondary issue caused automated failover to not work, rendering the entire metadata storage unavailable despite two other healthy zones being available.

Google

SRE Weekly Issue #442

Subscribe

RSS

Mastodon

Search Issues

A message from our sponsor, FireHydrant:

Subscribe

RSS

Mastodon

Search Issues