SRE Weekly Issue #443

I’m working on launching a new sibling project to SRE Weekly that will have a different format. I’m on the lookout for potential sponsors now, so if you’re interested, reply by email or drop me a note at lex at sreweekly dot com. And don’t worry! SRE Weekly itself is here to stay.

A message from our sponsor, FireHydrant:

FireHydrant has acquired Blameless! The addition of Blameless’ enterprise capabilities combined with FireHydrant’s platform creates the most comprehensive enterprise incident management solution in the market.

https://firehydrant.com/blog/press-release-firehydrant-acquires-blameless-to-further-solidify-enterprise/

Thinking of creating a microservice architecture? Maybe think twice, says this article — backed by solid arguments.

  Thiago Caserta

Octopus describes how their cell-based architecture is built for reliability, but it comes with a couple of trade-offs.

  Pawel Pabich — Octopus Deploy

In this blog post, we’ll reveal how we leveraged eBPF to achieve continuous, low-overhead instrumentation of the Linux scheduler, enabling effective self-serve monitoring of noisy neighbor issues.

  Jose Fernandez, Sebastien Dabdoub, Jason Koch, Artem Tkachuk — Netflix

Some great insights in this one, including these gems:

Myth #1: Redundancy Equals Reliability
Myth #2: Preventing Failure is the Only Goal
Myth #3: More Responders Equals Faster Resolution

  Paula Thrasher — PagerDuty

These folks learned the hard way that Node doesn’t implement Happy Eyeballs. Definitely worth a read if you use Node or if you aren’t familiar with Happy Eyeballs.

  Umut Uzgur and Nočnica Mellifera — Checkly

In this post, we’ll cover the basics of on-call scheduling, the different types of on-call schedules you can use and when each is most appropriate, best practices for managing on-call shifts, and all the mistakes people normally make along the way.

  Chris Evans — incident.io

There’s a subtle distinction between heterogeneous and homogeneous SLIs, but it’s important to understand which kind you’re working with and the pros and cons of each.

  Alex Ewerlöf

Cloudflare inadvertently revoked their advertisement for some IPv4 addresses that were still being used for customer traffic due to a subtle bug in their automation.

Updated: September 22, 2024 — 9:54 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme