The Linux OOM killer can already be a bugbear, and things only get more complicated when you add containers to the mix.
Rafał Korepta — RedPanda
This post explores how to align platform and product engineering teams by implementing business value proxy metrics and using incidents to inform them.
The same metrics that we use to measure other initiatives against business priorities may be able to show us whether our incident response process is effective.
Gonzalo Maldonado — FireHydrant
Here’s another take on devops vs SRE, using a metaphor of organizing a party.
how do you balance taking advantage of the acceleration and innovation of AI while not compromising reliability and losing users?
Jim Gochee — The New Stack
My favorite part is the bit about the risks of automation and keeping humans in the loop.
Dr. Mica Endsley — Business News This Week
It’s about reliability: IaC changes carry just as much risk to reliability as product code changes, if not more. How can we bring feature flags to IaC?
Josephine E. Justin, Srikanth Murali, and Norton Stanley S A — DZone
Oh, the tangled web we weave when we send automated emails.
Amin Astaneh — Certo Modo
Here are four things we learned while scaling up Presto to Meta scale, and some advice if you’re interested in running your own queries at scale.