I’m working on launching a new sibling project to SRE Weekly that will have a different format. I’m on the lookout for potential sponsors now, so if you’re interested, reply by email or drop me a note at lex at sreweekly dot com. And don’t worry! SRE Weekly itself is here to stay.
Thinking of creating a microservice architecture? Maybe think twice, says this article — backed by solid arguments.
Thiago Caserta
Octopus describes how their cell-based architecture is built for reliability, but it comes with a couple of trade-offs.
Pawel Pabich — Octopus Deploy
In this blog post, we’ll reveal how we leveraged eBPF to achieve continuous, low-overhead instrumentation of the Linux scheduler, enabling effective self-serve monitoring of noisy neighbor issues.
Jose Fernandez, Sebastien Dabdoub, Jason Koch, Artem Tkachuk — Netflix
Some great insights in this one, including these gems:
Myth #1: Redundancy Equals Reliability
Myth #2: Preventing Failure is the Only Goal
Myth #3: More Responders Equals Faster Resolution
Paula Thrasher — PagerDuty
These folks learned the hard way that Node doesn’t implement Happy Eyeballs. Definitely worth a read if you use Node or if you aren’t familiar with Happy Eyeballs.
Umut Uzgur and Nočnica Mellifera — Checkly
In this post, we’ll cover the basics of on-call scheduling, the different types of on-call schedules you can use and when each is most appropriate, best practices for managing on-call shifts, and all the mistakes people normally make along the way.
Chris Evans — incident.io
There’s a subtle distinction between heterogeneous and homogeneous SLIs, but it’s important to understand which kind you’re working with and the pros and cons of each.
Alex Ewerlöf
Cloudflare inadvertently revoked their advertisement for some IPv4 addresses that were still being used for customer traffic due to a subtle bug in their automation.