Finally! Someone actually explaining how they test their SRE agent. Having a testing methodology is table stakes. Showing their work helps us decide whether we can trust the tool.
With so many SRE agents floating around, it’s quite surprising to me that this kind of article is so rare.
Benjamin Barton — Datadog
An enlightening Deep dive into the way this Postgres resource management system evaluates the cost of queries in order to shed resource intensive ones.
Patrick Reynolds — PlanetScale
If you’ve ever been in an incident where communication suddenly went quiet and access got restricted, this article explains why. The author breaks down five fundamental ways security incident response diverges from outage response — and why the instincts that make you effective at one can actively work against you in the other.
Art Kondratiev — Uptime Labs
Security and reliability are inexorably intertwined. Examples: reliability failures leave security temporarily weak and vulnerable, and security changes have caused a number of recent high-profile outages/
Oreoluwa Omoike — DZone
Some timely reminders about the realities of how autoscaling actually works in Kubernetes. It’s all about tuning your mental model.
Ankush Madaan — DZone
There’s a limit to how far parallelism can get you, and it’s down to what part of your workload is by necessity serial.
[…] in practice, microservices that share a database or coordinate on every request are a distributed monolith with extra latency and a much harder debugging story.
David Iyanu Jonathan — DZone
This is a great story, and I really liked the section on why traditional reliability techniques (autoscaling, circuit breakers, and rate limits) weren’t enough.
Parveen Saini — DZone
