Skip to main content
Back to blog
Technical6 min read

NATS Authentication Hardening for Multi-Agent Systems

M
Moshe Beeri, Founder
/
natsauthenticationtlscredential-rotationmulti-agentsecurity

NATS is the nervous system of modern multi-agent platforms. At agent.ceo, every agent communicates over NATS JetStream -- task assignments, status updates, security alerts, and inter-agent coordination all flow through NATS subjects. When our AI CSO agent performed its first automated security audit, four of the 14 HIGH-severity findings were NATS authentication issues: shared tokens across agents, missing TLS enforcement, absent credential rotation, and overly broad subject permissions.

This post details what we found, why it matters, and exactly how we hardened NATS authentication for a fleet of autonomous AI agents.

The Vulnerability: Shared NATS Credentials

Rendering diagram…

In the initial deployment, agents shared a single NATS authentication token:

# VULNERABLE: Shared credential across all agents
# nats-config.yaml (BEFORE)
authorization {
  token: "s3cr3t-shared-token-for-all-agents"
}

This pattern creates cascading risks:

  1. Lateral movement: If one agent is compromised, the attacker gains access to every NATS subject
  2. No attribution: Impossible to distinguish which agent published a message
  3. Rotation nightmare: Changing the token requires simultaneous update across all agents
  4. Blast radius: A single credential leak exposes the entire messaging layer

The Fix: Per-Agent Token Isolation with NKey Authentication

We implemented NATS NKey-based authentication with per-agent keypairs and scoped permissions:

# nats-server.conf (AFTER - hardened)
authorization {
  # Each agent gets a dedicated user with scoped permissions
  users = [
    {
      # CSO Agent - security scanning
      nkey: "UABRZ3OQWKJTFXNCOEH5Y2PDQVXP5KLNKZVO7VHOKQCFUQNHB3WGXYZ"
      permissions: {
        publish: {
          allow: ["security.findings.>", "security.remediate.>"]
          deny: ["agents.commands.>"]
        }
        subscribe: {
          allow: ["security.>", "agents.status.>", "config.changes.>"]
          deny: ["_INBOX.>"]  # No direct reply subscriptions
        }
      }
    },
    {
      # DevOps Agent - deployment operations
      nkey: "UBCDE4PQRSKLTMXNOGH6Z3QERVXQ6LMOLAZWP8VIHLRDGVRHC4XHABC"
      permissions: {
        publish: {
          allow: ["deploy.>", "agents.status.devops"]
          deny: ["security.>", "credentials.>"]
        }
        subscribe: {
          allow: ["deploy.>", "tasks.devops.>"]
        }
      }
    },
    {
      # Marketing Agent - content operations
      nkey: "UCFGH5QRSTULVNXOPH7A4RFSWYR7MNPMBAAXQ9WJIMSEEHWSI5YIDEF"
      permissions: {
        publish: {
          allow: ["content.>", "agents.status.marketing"]
          deny: ["security.>", "deploy.>", "credentials.>"]
        }
        subscribe: {
          allow: ["content.>", "tasks.marketing.>"]
        }
      }
    }
  ]
}

Each agent now operates with the principle of least privilege. The marketing agent cannot publish to security topics. The CSO agent cannot issue deployment commands. Compromise of any single agent limits blast radius to that agent's scoped permissions.

TLS Enforcement: Encrypting Agent-to-Agent Communication

Our CSO agent flagged plaintext NATS connections as HIGH severity. Here is the TLS configuration we implemented:

# nats-server.conf - TLS configuration
tls {
  cert_file: "/etc/nats/certs/server-cert.pem"
  key_file: "/etc/nats/certs/server-key.pem"
  ca_file: "/etc/nats/certs/ca-cert.pem"

  # Require client certificates (mutual TLS)
  verify_and_map: true

  # Enforce TLS 1.3 minimum
  min_version: "1.3"

  # Strong cipher suites only
  cipher_suites: [
    "TLS_AES_256_GCM_SHA384",
    "TLS_CHACHA20_POLY1305_SHA256"
  ]

  # Connection timeout for TLS handshake
  timeout: 5
}

On the client side, each agent connects with its own certificate:

import nats
import ssl

async def create_secure_nats_connection(agent_name: str):
    """Create a TLS-secured NATS connection with per-agent credentials."""

    ssl_context = ssl.create_default_context(
        purpose=ssl.Purpose.SERVER_AUTH,
        cafile="/etc/nats/certs/ca-cert.pem"
    )
    ssl_context.load_cert_chain(
        certfile=f"/etc/nats/certs/{agent_name}-cert.pem",
        keyfile=f"/etc/nats/certs/{agent_name}-key.pem"
    )
    ssl_context.minimum_version = ssl.TLSVersion.TLSv1_3

    # Load agent-specific NKey for authentication
    nkey_seed = load_nkey_seed(agent_name)

    nc = await nats.connect(
        servers=["tls://nats.internal:4222"],
        tls=ssl_context,
        nkeys_seed=nkey_seed,
        name=agent_name,
        max_reconnect_attempts=10,
        reconnect_time_wait=2,
        error_cb=on_nats_error,
        disconnected_cb=on_disconnect,
        reconnected_cb=on_reconnect
    )

    return nc

Automated Credential Rotation

Static credentials, even when per-agent, eventually become a liability. We implemented automated rotation using a dedicated credential rotation service:

import asyncio
from datetime import datetime, timedelta
from cryptography.hazmat.primitives.asymmetric.ed25519 import Ed25519PrivateKey
from nats.nkeys import KeyPair

class NATSCredentialRotator:
    """Automated NATS credential rotation for agent fleet."""

    ROTATION_INTERVAL = timedelta(hours=24)
    GRACE_PERIOD = timedelta(minutes=30)

    def __init__(self, nats_admin_client, agent_registry):
        self.admin = nats_admin_client
        self.registry = agent_registry

    async def rotate_agent_credentials(self, agent_name: str):
        """Rotate NKey credentials for a specific agent."""

        # Generate new NKey pair
        new_keypair = KeyPair.create_user()
        new_public_key = new_keypair.public_key
        new_seed = new_keypair.seed

        # Phase 1: Add new key to server (both old and new valid)
        await self.admin.add_agent_nkey(
            agent_name=agent_name,
            public_key=new_public_key,
            permissions=self.registry.get_permissions(agent_name)
        )

        # Phase 2: Distribute new seed to agent securely
        await self.distribute_credential(
            agent_name=agent_name,
            credential=new_seed,
            via="encrypted_k8s_secret"
        )

        # Phase 3: Wait for agent to reconnect with new credential
        confirmed = await self.wait_for_reconnection(
            agent_name=agent_name,
            new_key=new_public_key,
            timeout=self.GRACE_PERIOD
        )

        if confirmed:
            # Phase 4: Revoke old key
            old_key = self.registry.get_current_key(agent_name)
            await self.admin.revoke_nkey(agent_name, old_key)

            # Update registry
            self.registry.update_key(agent_name, new_public_key)

            await self.audit_log(
                event="credential_rotation_complete",
                agent=agent_name,
                timestamp=datetime.utcnow()
            )
        else:
            # Rollback: remove new key, alert security team
            await self.admin.revoke_nkey(agent_name, new_public_key)
            await self.alert_security_team(
                f"Credential rotation failed for {agent_name}"
            )

    async def run_rotation_loop(self):
        """Continuously rotate credentials for all agents."""
        while True:
            agents = await self.registry.list_agents()
            for agent in agents:
                last_rotation = agent.last_credential_rotation
                if datetime.utcnow() - last_rotation > self.ROTATION_INTERVAL:
                    await self.rotate_agent_credentials(agent.name)

            await asyncio.sleep(300)  # Check every 5 minutes

Subject-Level Authorization Patterns

Beyond per-agent tokens, we implemented hierarchical subject authorization that maps to our multi-tenant architecture:

# Account-level isolation for multi-tenant deployments
accounts {
  TENANT_A {
    users: [
      { nkey: "U...", permissions: { publish: "tenant_a.>" } }
    ]
    exports: [
      { stream: "tenant_a.events.public" }
    ]
  }

  TENANT_B {
    users: [
      { nkey: "U...", permissions: { publish: "tenant_b.>" } }
    ]
    imports: [
      { stream: { account: TENANT_A, subject: "tenant_a.events.public" } }
    ]
  }

  # Platform-level agents (CSO, DevOps) in separate account
  PLATFORM {
    users: [
      {
        nkey: "U..."
        permissions: {
          publish: { allow: "platform.>" }
          subscribe: { allow: "*.status.>" }  # Cross-tenant status visibility
        }
      }
    ]
    imports: [
      # Security agents can observe all tenant activity metadata
      { stream: { account: TENANT_A, subject: "tenant_a.security.>" } }
      { stream: { account: TENANT_B, subject: "tenant_b.security.>" } }
    ]
  }
}

Monitoring and Alerting on Auth Failures

We instrument NATS to detect potential attacks in real time:

async def monitor_auth_failures(nats_admin):
    """Monitor for authentication anomalies indicating attack attempts."""

    async def on_auth_failure(event):
        failure_count = await increment_failure_counter(
            source_ip=event.source_ip,
            window=timedelta(minutes=5)
        )

        if failure_count > 10:
            # Potential brute force - block at network level
            await block_source(event.source_ip)
            await alert(
                severity="HIGH",
                message=f"NATS auth brute force from {event.source_ip}",
                action="auto_blocked"
            )

        if event.attempted_subject in SENSITIVE_SUBJECTS:
            # Attempted access to security/credential topics
            await alert(
                severity="CRITICAL",
                message=f"Unauthorized access attempt to {event.attempted_subject}",
                source=event.source_ip
            )

    await nats_admin.subscribe("$SYS.AUTH.FAILURES", on_auth_failure)

Results After Hardening

After implementing these changes across our resilient agent fleet:

  • Zero shared credentials: Each of 12+ agents has isolated NKey authentication
  • 24-hour rotation: All credentials rotate automatically every 24 hours
  • mTLS everywhere: No plaintext NATS traffic, TLS 1.3 enforced
  • Blast radius contained: Compromising one agent grants access only to that agent's scoped subjects
  • Full audit trail: Every authentication event logged for SOC 2 evidence

These patterns apply to any organization building multi-agent AI systems that rely on message-based coordination. The investment in proper NATS authentication pays dividends in security posture and compliance readiness.

For the broader credential management strategy across cloud providers, see our post on credential management for multi-cloud AI agents.

Whether you choose the hosted SaaS platform or a private enterprise installation, agent.ceo delivers the same autonomous workforce capabilities.

Try agent.ceo

SaaS — Get started with 1 free agent-week at agent.ceo.

Enterprise — For private installation on your own infrastructure, contact enterprise@agent.ceo.


agent.ceo is built by GenBrain AI — a GenAI-first autonomous agent orchestration platform. General inquiries: hello@agent.ceo | Security: security@agent.ceo

Related articles