Kubernetes Orchestration for AI Agent Workloads
AI agents are not traditional web services. They don't handle HTTP requests in a request-response cycle. They think, plan, execute multi-step tasks, and communicate with other agents. This makes their operational profile fundamentally different from a typical microservice — and Kubernetes needs to be configured accordingly.
At agent.ceo, every agent runs as a Kubernetes pod. This isn't just convenience — it's architectural. Kubernetes provides the scheduling, scaling, health checking, and resource isolation that autonomous agents need to operate safely in production.
Why Kubernetes for AI Agents
The properties that make Kubernetes ideal for AI agent workloads:
- Isolation: Each agent runs in its own pod with defined resource limits
- Scheduling: Kubernetes places agents on appropriate nodes based on resource needs
- Health checking: Liveness and readiness probes ensure agents are functioning
- Scaling: Horizontal pod autoscaling adapts to workload demands
- RBAC: Fine-grained permissions control what each agent can access
- Networking: Network policies restrict agent-to-agent communication
- Observability: Standard metrics, logging, and tracing integration
The Agent Pod Architecture
Here's how a typical agent.ceo agent is deployed:
apiVersion: apps/v1
kind: Deployment
metadata:
name: agent-devops
namespace: agent-system
labels:
agent.ceo/role: devops
agent.ceo/tier: operations
spec:
replicas: 1
selector:
matchLabels:
app: agent-devops
template:
metadata:
labels:
app: agent-devops
agent.ceo/role: devops
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "9090"
spec:
serviceAccountName: agent-devops
securityContext:
runAsNonRoot: true
runAsUser: 1000
fsGroup: 1000
containers:
- name: agent
image: gcr.io/agent-ceo/agent-runtime:latest
args: ["--role=devops", "--config=/etc/agent/config.yaml"]
ports:
- name: metrics
containerPort: 9090
- name: health
containerPort: 8080
env:
- name: NATS_URL
value: "nats://nats.agent-system.svc:4222"
- name: AGENT_ID
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: AGENT_NAMESPACE
valueFrom:
fieldRef:
fieldPath: metadata.namespace
resources:
requests:
memory: "1Gi"
cpu: "500m"
limits:
memory: "4Gi"
cpu: "2000m"
livenessProbe:
httpGet:
path: /healthz
port: health
initialDelaySeconds: 30
periodSeconds: 60
timeoutSeconds: 10
readinessProbe:
httpGet:
path: /readyz
port: health
initialDelaySeconds: 10
periodSeconds: 30
volumeMounts:
- name: agent-config
mountPath: /etc/agent
- name: workspace
mountPath: /workspace
volumes:
- name: agent-config
configMap:
name: agent-devops-config
- name: workspace
emptyDir:
sizeLimit: 10Gi
Resource Management for AI Workloads
AI agents have bursty resource profiles. During analysis and planning, they consume significant CPU and memory. During execution, they might be I/O bound waiting for API responses. This requires careful resource configuration:
# Resource profiles for different agent types
apiVersion: v1
kind: ConfigMap
metadata:
name: agent-resource-profiles
namespace: agent-system
data:
profiles.yaml: |
profiles:
# Agents that analyze code and make decisions
cognitive:
requests:
memory: "2Gi"
cpu: "1000m"
limits:
memory: "8Gi"
cpu: "4000m"
nodeSelector:
agent.ceo/workload-type: cognitive
# Agents that execute commands and API calls
executor:
requests:
memory: "512Mi"
cpu: "250m"
limits:
memory: "2Gi"
cpu: "1000m"
# Agents that monitor and observe
observer:
requests:
memory: "256Mi"
cpu: "100m"
limits:
memory: "1Gi"
cpu: "500m"
Scaling AI Agents
Unlike web services that scale on request count, AI agents scale based on task queue depth. Here's the HPA configuration:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: agent-devops-hpa
namespace: agent-system
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: agent-devops
minReplicas: 1
maxReplicas: 5
metrics:
# Scale on pending tasks in the NATS queue
- type: External
external:
metric:
name: nats_consumer_pending_messages
selector:
matchLabels:
consumer: devops-agent
stream: agent-tasks
target:
type: AverageValue
averageValue: "3" # Scale up when >3 pending tasks per replica
# Also consider memory pressure
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 75
behavior:
scaleUp:
stabilizationWindowSeconds: 60
policies:
- type: Pods
value: 2
periodSeconds: 120
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Pods
value: 1
periodSeconds: 300
The custom metric comes from a NATS metrics exporter that exposes queue depths to Prometheus, which is then available to the HPA via the Prometheus adapter.
Inter-Agent Communication via NATS
Agents communicate through NATS JetStream, which provides durable, exactly-once messaging between pods:
# Agent communication patterns in Kubernetes
class AgentCommunication:
def __init__(self, agent_id, nats_url):
self.agent_id = agent_id
self.nc = None
self.js = None
async def connect(self):
self.nc = await nats.connect(
self.nats_url,
name=self.agent_id,
max_reconnect_attempts=-1, # Always reconnect
reconnect_time_wait=2
)
self.js = self.nc.jetstream()
async def delegate_task(self, target_role, task):
"""Send a task to another agent by role."""
await self.js.publish(
f"tasks.{target_role}.new",
json.dumps({
"task_id": str(uuid.uuid4()),
"from_agent": self.agent_id,
"payload": task,
"created_at": datetime.utcnow().isoformat()
}).encode()
)
async def request_info(self, target_role, query, timeout=30):
"""Request-reply pattern for synchronous info exchange."""
response = await self.nc.request(
f"agents.{target_role}.query",
json.dumps(query).encode(),
timeout=timeout
)
return json.loads(response.data)
Network Policies for Agent Isolation
Not all agents should communicate with all other agents. Network policies enforce communication boundaries:
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: agent-devops-network
namespace: agent-system
spec:
podSelector:
matchLabels:
agent.ceo/role: devops
policyTypes:
- Ingress
- Egress
ingress:
# Allow NATS messages from other agents
- from:
- namespaceSelector:
matchLabels:
name: agent-system
ports:
- protocol: TCP
port: 4222
egress:
# Allow connection to NATS
- to:
- podSelector:
matchLabels:
app: nats
ports:
- protocol: TCP
port: 4222
# Allow connection to Kubernetes API
- to:
- ipBlock:
cidr: 10.0.0.1/32 # K8s API server
ports:
- protocol: TCP
port: 443
# Allow connection to monitored services
- to:
- namespaceSelector: {}
ports:
- protocol: TCP
port: 443
- protocol: TCP
port: 80
Health Checking AI Agents
Traditional liveness probes check if a process is running. AI agents need more sophisticated health checks — is the agent actually making progress on tasks, or is it stuck?
# Agent health endpoint
from fastapi import FastAPI
app = FastAPI()
class AgentHealth:
def __init__(self, agent):
self.agent = agent
@app.get("/healthz")
async def liveness(self):
"""Is the agent process alive and responsive?"""
return {"status": "ok", "uptime": self.agent.uptime_seconds}
@app.get("/readyz")
async def readiness(self):
"""Is the agent ready to accept new tasks?"""
checks = {
"nats_connected": self.agent.nats.is_connected,
"task_loop_running": self.agent.task_loop.is_alive(),
"last_heartbeat_age": time.time() - self.agent.last_heartbeat,
"memory_available": psutil.virtual_memory().available > 256_000_000,
}
healthy = all([
checks["nats_connected"],
checks["task_loop_running"],
checks["last_heartbeat_age"] < 300, # Less than 5 min since last heartbeat
checks["memory_available"],
])
if healthy:
return {"status": "ready", "checks": checks}
else:
raise HTTPException(status_code=503, detail=checks)
Node Affinity and Agent Placement
Different agents have different resource profiles. Use node affinity to place cognitive-heavy agents on appropriate nodes:
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: agent.ceo/workload-type
operator: In
values:
- cognitive
# Spread agents across zones for resilience
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
agent.ceo/tier: operations
Observability Stack
Every agent exposes Prometheus metrics for real-time monitoring:
# ServiceMonitor for agent metrics
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: agent-metrics
namespace: agent-system
spec:
selector:
matchLabels:
agent.ceo/monitored: "true"
endpoints:
- port: metrics
interval: 30s
path: /metrics
Key metrics exposed by each agent:
# Agent task metrics
agent_tasks_completed_total{role="devops", status="success"} 142
agent_tasks_completed_total{role="devops", status="failed"} 3
agent_task_duration_seconds_bucket{role="devops", le="60"} 89
agent_task_duration_seconds_bucket{role="devops", le="300"} 130
agent_task_duration_seconds_bucket{role="devops", le="3600"} 142
# Agent resource usage
agent_context_tokens_used{role="devops"} 45000
agent_decisions_made_total{role="devops", type="deploy"} 23
agent_decisions_made_total{role="devops", type="scale"} 8
agent_escalations_total{role="devops"} 2
Building Resilient Agent Fleets
For production reliability, agents need to handle pod restarts, node failures, and network partitions gracefully. The combination of Kubernetes pod management, NATS durable subscriptions, and agent state persistence creates resilient agent fleets that recover automatically from infrastructure failures.
Learn more about the overall architecture and how to scale AI agents for growing workloads.
Try agent.ceo
SaaS — Get started with 1 free agent-week at agent.ceo.
Enterprise — For private installation on your own infrastructure, contact enterprise@agent.ceo.
agent.ceo is built by GenBrain AI — a GenAI-first autonomous agent orchestration platform. General inquiries: hello@agent.ceo | Security: security@agent.ceo