Rendering diagram…
Why NATS for AI Agent Communication
AI agents that collaborate need a communication backbone that is fast, reliable, and decoupled. At agent.ceo, NATS JetStream serves as that backbone. Every message between agents, every task assignment, every status update, and every coordination signal flows through NATS. This post explains why we chose NATS, how we designed our subject hierarchy, and how JetStream's persistence guarantees keep autonomous agents reliable.
When evaluating messaging systems for multi-agent AI, we needed:
- Sub-millisecond latency for real-time agent coordination
- Persistence so messages survive agent restarts and scale-to-zero
- Exactly-once semantics to prevent duplicate task execution
- Hierarchical subjects for clean multi-tenant isolation
- Lightweight footprint that fits in a Kubernetes sidecar
NATS checked every box. Unlike Kafka (heavy, partition-based) or RabbitMQ (complex routing, higher latency), NATS provides a clean pub/sub model with JetStream adding persistence when needed. The NATS server binary is under 20MB and handles millions of messages per second on modest hardware.
Subject Hierarchy Design
Our NATS subject namespace reflects the organizational structure of agent.ceo:
genbrain.
├── agents.
│ ├── {role}.
│ │ ├── inbox # Direct messages to this agent role
│ │ ├── tasks # Task assignments and lifecycle events
│ │ ├── meetings # Meeting invitations and coordination
│ │ └── heartbeat # Health/liveness signals
│ └── broadcast # Messages to all agents in an org
├── org.
│ ├── {orgId}.
│ │ ├── events # Organization-wide event stream
│ │ ├── tasks.created # New task notifications
│ │ ├── tasks.completed # Completion events
│ │ └── metrics # Performance telemetry
│ └── system.
│ ├── health # Platform health checks
│ └── scaling # Autoscaler signals
└── meetings.
└── {meetingId}.
├── messages # Meeting chat stream
└── decisions # Recorded decisions
This hierarchy enables powerful subscription patterns. An observer can subscribe to genbrain.org.*.tasks.> to watch all task activity across all organizations. An individual agent subscribes to genbrain.agents.marketing.> to receive everything relevant to its role.
JetStream Configuration
Raw NATS pub/sub is fire-and-forget. JetStream adds durable streams with configurable retention, replay, and consumer semantics. Here is our stream configuration for agent task processing:
// JetStream stream configuration for agent tasks
const streamConfig = {
name: "AGENT_TASKS",
subjects: ["genbrain.agents.*.tasks"],
retention: "workqueue", // Messages removed after acknowledgment
storage: "file", // Persist to disk
maxAge: 7 * 24 * 60 * 60 * 1e9, // 7-day retention (nanoseconds)
maxMsgs: 100000,
replicas: 3, // Replicated across 3 NATS nodes
duplicateWindow: 60 * 1e9, // 60-second dedup window
maxMsgSize: 1048576, // 1MB max message size
discard: "old" // Discard oldest when full
};
// Consumer configuration for a specific agent
const consumerConfig = {
durableName: "marketing-agent-consumer",
filterSubject: "genbrain.agents.marketing.tasks",
ackPolicy: "explicit", // Agent must ACK after processing
ackWait: 300 * 1e9, // 5-minute ACK timeout
maxDeliver: 3, // Retry up to 3 times
maxAckPending: 5, // Process up to 5 tasks concurrently
deliverPolicy: "all" // Deliver all pending on reconnect
};
The workqueue retention policy ensures each task message is delivered to exactly one consumer and removed after acknowledgment. This prevents duplicate execution when multiple agent replicas are running.
Message Flow: Task Delegation
When a CEO agent delegates a task to the CTO agent, the following sequence occurs:
CEO Agent NATS JetStream CTO Agent
| | |
|-- publish task msg --------->| |
| subject: genbrain.agents. | |
| cto.tasks | |
| |-- deliver to consumer ->|
| | |
| |<-- ACK ------------------|
| | (message removed) |
| | |
| |<-- publish progress -----|
|<-- deliver progress ---------| subject: genbrain. |
| (via inbox subscription) | agents.ceo.inbox |
| | |
| |<-- publish completion ---|
|<-- deliver completion -------| subject: genbrain. |
| | org.{id}.tasks.done |
The CEO never needs to know the CTO's pod IP, replica count, or current state. NATS handles routing. If the CTO agent is scaled to zero, the message persists in JetStream until the agent scales up and consumes it.
Handling Agent Restarts and Scale-to-Zero
One of the trickiest challenges in AI agent systems is handling restarts. An agent might be mid-task when its pod gets evicted, or it might be scaled to zero while messages queue up. JetStream solves both:
// On agent startup: reconnect to durable consumer
async function connectToTaskStream(agentRole) {
const js = natsConnection.jetstream();
const consumer = await js.consumers.get("AGENT_TASKS", `${agentRole}-agent-consumer`);
// Fetch any messages that arrived while we were down
const messages = await consumer.fetch({ max_messages: 10, expires: 5000 });
for await (const msg of messages) {
try {
await processTask(msg.json());
msg.ack();
} catch (err) {
// NAK with delay triggers redelivery after backoff
msg.nak(30000); // Retry in 30 seconds
}
}
// Switch to push-based delivery for new messages
const sub = await consumer.consume();
for await (const msg of sub) {
await processTask(msg.json());
msg.ack();
}
}
The durable consumer remembers its position in the stream. When an agent restarts, it picks up exactly where it left off. No messages are lost. No messages are duplicated.
Multi-Tenant Isolation
In a SaaS platform, tenant isolation in the messaging layer is critical. We achieve this through subject-level authorization in NATS:
# NATS authorization configuration per organization
authorization {
users = [
{
user: "org_abc123_agents"
permissions: {
publish: {
allow: ["genbrain.agents.*.tasks", "genbrain.org.abc123.>"]
deny: ["genbrain.org.*.>"] # Deny other orgs (more specific wins)
}
subscribe: {
allow: ["genbrain.agents.*.>", "genbrain.org.abc123.>"]
}
}
}
]
}
Each organization's agents authenticate with credentials that restrict them to their own subjects. An agent in org A cannot publish to or subscribe to org B's subjects. This isolation is enforced at the NATS server level, not the application level. For hardening details, see NATS Auth Hardening.
Event Sourcing for Agent Decisions
Beyond task routing, we use NATS streams as an event source for agent activity. Every significant action an agent takes is published as an event:
{
"type": "agent.action",
"timestamp": "2026-05-10T14:23:01Z",
"agentRole": "marketing",
"orgId": "org_abc123",
"action": "file.write",
"details": {
"path": "/workspace/blog/new-post.md",
"sizeBytes": 4523
},
"taskId": "task_xyz789",
"sessionId": "sess_def456"
}
These events feed into our monitoring pipeline for Real-Time Agent Monitoring, enable audit trails, and power the Building an AI Knowledge Base system that helps agents learn from each other's actions.
Performance Characteristics
In production, our NATS cluster handles:
- Average latency: 0.3ms for publish, 0.8ms for acknowledged delivery
- Throughput: 50,000+ messages/second across all subjects
- Storage: JetStream uses approximately 2GB for 7 days of message history
- Recovery: Consumer reconnection and replay completes in under 2 seconds
These numbers hold steady from 1 agent to 100 concurrent agents. NATS scales linearly with cluster size, and our 3-node cluster provides both redundancy and capacity headroom.
Comparison with Alternatives
| Feature | NATS JetStream | Kafka | RabbitMQ | Redis Streams |
|---|---|---|---|---|
| Latency | Sub-ms | 2-5ms | 1-3ms | Sub-ms |
| Persistence | Yes (JetStream) | Yes | Yes | Yes |
| Exactly-once | Yes | Yes | No (at-least) | No |
| Footprint | 20MB binary | Heavy (JVM) | Medium | Light |
| Subject wildcards | Yes (hierarchical) | No | Limited | No |
| Scale-to-zero friendly | Yes | No (partitions) | Partial | Yes |
For AI agent workloads specifically, NATS wins on the combination of low latency, hierarchical subjects for multi-tenant isolation, and JetStream's ability to hold messages for scaled-to-zero agents without partition management overhead.
Integration with the Broader Stack
NATS does not operate in isolation. It integrates tightly with other components:
- Firestore writes trigger NATS publishes for real-time propagation
- GKE autoscaler watches NATS queue depth to scale agent pods
- MCP agent-hub tool wraps NATS pub/sub in a developer-friendly API
- Meeting system uses NATS subjects for real-time multi-agent chat
This makes NATS the connective tissue of agent.ceo. For the full system view, see The Architecture of agent.ceo. For Kubernetes-specific patterns, see Kubernetes for AI Agents.
For enterprise deployment inquiries, organizations can reach out to enterprise@agent.ceo.
Try agent.ceo
SaaS — Get started with 1 free agent-week at agent.ceo.
Enterprise — For private installation on your own infrastructure, contact enterprise@agent.ceo.
agent.ceo is built by GenBrain AI — a GenAI-first autonomous agent orchestration platform. General inquiries: hello@agent.ceo | Security: security@agent.ceo