SRE Weekly Issue #514

How we built a real-world evaluation platform for autonomous SRE agents at scale

Finally! Someone actually explaining how they test their SRE agent. Having a testing methodology is table stakes. Showing their work helps us decide whether we can trust the tool.

With so many SRE agents floating around, it’s quite surprising to me that this kind of article is so rare.

Benjamin Barton — Datadog

Behind the scenes: How Database Traffic Control works

An enlightening Deep dive into the way this Postgres resource management system evaluates the cost of queries in order to shed resource intensive ones.

Patrick Reynolds — PlanetScale

Why Security Incidents Feel Different from Outages

If you’ve ever been in an incident where communication suddenly went quiet and access got restricted, this article explains why. The author breaks down five fundamental ways security incident response diverges from outage response — and why the instincts that make you effective at one can actively work against you in the other.

Art Kondratiev — Uptime Labs

Reliability Is Security: Why SRE Teams Are Becoming the Frontline of Cloud Defense

Security and reliability are inexorably intertwined. Examples: reliability failures leave security temporarily weak and vulnerable, and security changes have caused a number of recent high-profile outages/

Oreoluwa Omoike — DZone

Kubernetes Autoscaling: What Breaks Under Real Traffic

Some timely reminders about the realities of how autoscaling actually works in Kubernetes. It’s all about tuning your mental model.

Ankush Madaan — DZone

The Myth of Horizontal Scalability

There’s a limit to how far parallelism can get you, and it’s down to what part of your workload is by necessity serial.

[…] in practice, microservices that share a database or coordinate on every request are a distributed monolith with extra latency and a much harder debugging story.

David Iyanu Jonathan — DZone

How Our gRPC Services Collapsed During Traffic Bursts — and What Finally Stopped It

This is a great story, and I really liked the section on why traditional reliability techniques (autoscaling, circuit breakers, and rate limits) weren’t enough.

Parveen Saini — DZone

SRE Weekly Issue #514

Subscribe

RSS

Mastodon

Search Issues