NATS is the nervous system of modern multi-agent platforms. At agent.ceo, every agent communicates over NATS JetStream -- task assignments, status updates, security alerts, and inter-agent coordination all flow through NATS subjects. When our AI CSO agent performed its first automated security audit, four of the 14 HIGH-severity findings were NATS authentication issues: shared tokens across agents, missing TLS enforcement, absent credential rotation, and overly broad subject permissions.
This post details what we found, why it matters, and exactly how we hardened NATS authentication for a fleet of autonomous AI agents.
The Vulnerability: Shared NATS Credentials
Rendering diagram…
In the initial deployment, agents shared a single NATS authentication token:
# VULNERABLE: Shared credential across all agents
# nats-config.yaml (BEFORE)
authorization {
token: "s3cr3t-shared-token-for-all-agents"
}
This pattern creates cascading risks:
- Lateral movement: If one agent is compromised, the attacker gains access to every NATS subject
- No attribution: Impossible to distinguish which agent published a message
- Rotation nightmare: Changing the token requires simultaneous update across all agents
- Blast radius: A single credential leak exposes the entire messaging layer
The Fix: Per-Agent Token Isolation with NKey Authentication
We implemented NATS NKey-based authentication with per-agent keypairs and scoped permissions:
# nats-server.conf (AFTER - hardened)
authorization {
# Each agent gets a dedicated user with scoped permissions
users = [
{
# CSO Agent - security scanning
nkey: "UABRZ3OQWKJTFXNCOEH5Y2PDQVXP5KLNKZVO7VHOKQCFUQNHB3WGXYZ"
permissions: {
publish: {
allow: ["security.findings.>", "security.remediate.>"]
deny: ["agents.commands.>"]
}
subscribe: {
allow: ["security.>", "agents.status.>", "config.changes.>"]
deny: ["_INBOX.>"] # No direct reply subscriptions
}
}
},
{
# DevOps Agent - deployment operations
nkey: "UBCDE4PQRSKLTMXNOGH6Z3QERVXQ6LMOLAZWP8VIHLRDGVRHC4XHABC"
permissions: {
publish: {
allow: ["deploy.>", "agents.status.devops"]
deny: ["security.>", "credentials.>"]
}
subscribe: {
allow: ["deploy.>", "tasks.devops.>"]
}
}
},
{
# Marketing Agent - content operations
nkey: "UCFGH5QRSTULVNXOPH7A4RFSWYR7MNPMBAAXQ9WJIMSEEHWSI5YIDEF"
permissions: {
publish: {
allow: ["content.>", "agents.status.marketing"]
deny: ["security.>", "deploy.>", "credentials.>"]
}
subscribe: {
allow: ["content.>", "tasks.marketing.>"]
}
}
}
]
}
Each agent now operates with the principle of least privilege. The marketing agent cannot publish to security topics. The CSO agent cannot issue deployment commands. Compromise of any single agent limits blast radius to that agent's scoped permissions.
TLS Enforcement: Encrypting Agent-to-Agent Communication
Our CSO agent flagged plaintext NATS connections as HIGH severity. Here is the TLS configuration we implemented:
# nats-server.conf - TLS configuration
tls {
cert_file: "/etc/nats/certs/server-cert.pem"
key_file: "/etc/nats/certs/server-key.pem"
ca_file: "/etc/nats/certs/ca-cert.pem"
# Require client certificates (mutual TLS)
verify_and_map: true
# Enforce TLS 1.3 minimum
min_version: "1.3"
# Strong cipher suites only
cipher_suites: [
"TLS_AES_256_GCM_SHA384",
"TLS_CHACHA20_POLY1305_SHA256"
]
# Connection timeout for TLS handshake
timeout: 5
}
On the client side, each agent connects with its own certificate:
import nats
import ssl
async def create_secure_nats_connection(agent_name: str):
"""Create a TLS-secured NATS connection with per-agent credentials."""
ssl_context = ssl.create_default_context(
purpose=ssl.Purpose.SERVER_AUTH,
cafile="/etc/nats/certs/ca-cert.pem"
)
ssl_context.load_cert_chain(
certfile=f"/etc/nats/certs/{agent_name}-cert.pem",
keyfile=f"/etc/nats/certs/{agent_name}-key.pem"
)
ssl_context.minimum_version = ssl.TLSVersion.TLSv1_3
# Load agent-specific NKey for authentication
nkey_seed = load_nkey_seed(agent_name)
nc = await nats.connect(
servers=["tls://nats.internal:4222"],
tls=ssl_context,
nkeys_seed=nkey_seed,
name=agent_name,
max_reconnect_attempts=10,
reconnect_time_wait=2,
error_cb=on_nats_error,
disconnected_cb=on_disconnect,
reconnected_cb=on_reconnect
)
return nc
Automated Credential Rotation
Static credentials, even when per-agent, eventually become a liability. We implemented automated rotation using a dedicated credential rotation service:
import asyncio
from datetime import datetime, timedelta
from cryptography.hazmat.primitives.asymmetric.ed25519 import Ed25519PrivateKey
from nats.nkeys import KeyPair
class NATSCredentialRotator:
"""Automated NATS credential rotation for agent fleet."""
ROTATION_INTERVAL = timedelta(hours=24)
GRACE_PERIOD = timedelta(minutes=30)
def __init__(self, nats_admin_client, agent_registry):
self.admin = nats_admin_client
self.registry = agent_registry
async def rotate_agent_credentials(self, agent_name: str):
"""Rotate NKey credentials for a specific agent."""
# Generate new NKey pair
new_keypair = KeyPair.create_user()
new_public_key = new_keypair.public_key
new_seed = new_keypair.seed
# Phase 1: Add new key to server (both old and new valid)
await self.admin.add_agent_nkey(
agent_name=agent_name,
public_key=new_public_key,
permissions=self.registry.get_permissions(agent_name)
)
# Phase 2: Distribute new seed to agent securely
await self.distribute_credential(
agent_name=agent_name,
credential=new_seed,
via="encrypted_k8s_secret"
)
# Phase 3: Wait for agent to reconnect with new credential
confirmed = await self.wait_for_reconnection(
agent_name=agent_name,
new_key=new_public_key,
timeout=self.GRACE_PERIOD
)
if confirmed:
# Phase 4: Revoke old key
old_key = self.registry.get_current_key(agent_name)
await self.admin.revoke_nkey(agent_name, old_key)
# Update registry
self.registry.update_key(agent_name, new_public_key)
await self.audit_log(
event="credential_rotation_complete",
agent=agent_name,
timestamp=datetime.utcnow()
)
else:
# Rollback: remove new key, alert security team
await self.admin.revoke_nkey(agent_name, new_public_key)
await self.alert_security_team(
f"Credential rotation failed for {agent_name}"
)
async def run_rotation_loop(self):
"""Continuously rotate credentials for all agents."""
while True:
agents = await self.registry.list_agents()
for agent in agents:
last_rotation = agent.last_credential_rotation
if datetime.utcnow() - last_rotation > self.ROTATION_INTERVAL:
await self.rotate_agent_credentials(agent.name)
await asyncio.sleep(300) # Check every 5 minutes
Subject-Level Authorization Patterns
Beyond per-agent tokens, we implemented hierarchical subject authorization that maps to our multi-tenant architecture:
# Account-level isolation for multi-tenant deployments
accounts {
TENANT_A {
users: [
{ nkey: "U...", permissions: { publish: "tenant_a.>" } }
]
exports: [
{ stream: "tenant_a.events.public" }
]
}
TENANT_B {
users: [
{ nkey: "U...", permissions: { publish: "tenant_b.>" } }
]
imports: [
{ stream: { account: TENANT_A, subject: "tenant_a.events.public" } }
]
}
# Platform-level agents (CSO, DevOps) in separate account
PLATFORM {
users: [
{
nkey: "U..."
permissions: {
publish: { allow: "platform.>" }
subscribe: { allow: "*.status.>" } # Cross-tenant status visibility
}
}
]
imports: [
# Security agents can observe all tenant activity metadata
{ stream: { account: TENANT_A, subject: "tenant_a.security.>" } }
{ stream: { account: TENANT_B, subject: "tenant_b.security.>" } }
]
}
}
Monitoring and Alerting on Auth Failures
We instrument NATS to detect potential attacks in real time:
async def monitor_auth_failures(nats_admin):
"""Monitor for authentication anomalies indicating attack attempts."""
async def on_auth_failure(event):
failure_count = await increment_failure_counter(
source_ip=event.source_ip,
window=timedelta(minutes=5)
)
if failure_count > 10:
# Potential brute force - block at network level
await block_source(event.source_ip)
await alert(
severity="HIGH",
message=f"NATS auth brute force from {event.source_ip}",
action="auto_blocked"
)
if event.attempted_subject in SENSITIVE_SUBJECTS:
# Attempted access to security/credential topics
await alert(
severity="CRITICAL",
message=f"Unauthorized access attempt to {event.attempted_subject}",
source=event.source_ip
)
await nats_admin.subscribe("$SYS.AUTH.FAILURES", on_auth_failure)
Results After Hardening
After implementing these changes across our resilient agent fleet:
- Zero shared credentials: Each of 12+ agents has isolated NKey authentication
- 24-hour rotation: All credentials rotate automatically every 24 hours
- mTLS everywhere: No plaintext NATS traffic, TLS 1.3 enforced
- Blast radius contained: Compromising one agent grants access only to that agent's scoped subjects
- Full audit trail: Every authentication event logged for SOC 2 evidence
These patterns apply to any organization building multi-agent AI systems that rely on message-based coordination. The investment in proper NATS authentication pays dividends in security posture and compliance readiness.
For the broader credential management strategy across cloud providers, see our post on credential management for multi-cloud AI agents.
Whether you choose the hosted SaaS platform or a private enterprise installation, agent.ceo delivers the same autonomous workforce capabilities.
Try agent.ceo
SaaS — Get started with 1 free agent-week at agent.ceo.
Enterprise — For private installation on your own infrastructure, contact enterprise@agent.ceo.
agent.ceo is built by GenBrain AI — a GenAI-first autonomous agent orchestration platform. General inquiries: hello@agent.ceo | Security: security@agent.ceo