A single AI agent crashing is an inconvenience. Your entire agent fleet going down because of a cascading failure is a catastrophe. When agents own real workflows — deploying code, managing infrastructure, handling security incidents — resilience is not optional. It is a core architectural requirement.
This guide covers the patterns, practices, and tooling required to build agent fleets that self-heal, degrade gracefully, and maintain organizational continuity even when individual agents fail.
Failure Modes in Agent Fleets
Rendering diagram…
AI agents fail differently from traditional services. Understanding the failure modes is the first step to handling them:
1. Agent Crash (Process Death)
The agent process terminates unexpectedly — OOM kill, node failure, or runtime panic.
Impact: In-flight task is interrupted. Messages in the agent's buffer are lost.
Mitigation: Durable NATS consumers with explicit acknowledgment ensure no messages are lost. Kubernetes restart policies bring the agent back.
# Kubernetes restart policy
spec:
containers:
- name: agent
resources:
limits:
memory: "4Gi" # Set limits to prevent OOM from affecting node
livenessProbe:
httpGet:
path: /health/live
port: 8080
initialDelaySeconds: 30
periodSeconds: 30
failureThreshold: 3
restartPolicy: Always
2. Agent Stuck (Infinite Loop or Deadlock)
The agent is alive but not making progress — stuck in a reasoning loop, waiting for a response that will never come, or endlessly retrying a failing operation.
Impact: Task never completes. Queue depth grows. Dependent agents block.
Mitigation: Task timeouts, progress watchdogs, and circuit breakers.
# Task timeout configuration
class TaskExecutor:
def __init__(self, max_duration=timedelta(minutes=30)):
self.max_duration = max_duration
self.last_progress = datetime.now()
self.progress_timeout = timedelta(minutes=5)
def execute(self, task):
start = datetime.now()
while not task.complete:
if datetime.now() - start > self.max_duration:
raise TaskTimeoutError(f"Task exceeded {self.max_duration}")
if datetime.now() - self.last_progress > self.progress_timeout:
raise StuckDetectedError("No progress in 5 minutes")
# Execute next step
step_result = self.agent.next_step(task)
self.last_progress = datetime.now()
# Publish progress
self.publish_progress(task, step_result)
3. Agent Producing Bad Output
The agent completes tasks but produces incorrect results — buggy code, wrong configurations, flawed security assessments.
Impact: Downstream agents and systems consume bad output. Errors compound.
Mitigation: Output validation, peer review gates, and rollback capabilities.
# Output validation pipeline
validation:
code_changes:
- check: "tests_pass"
required: true
on_failure: "reject_and_retry"
- check: "lint_clean"
required: true
on_failure: "auto_fix_and_revalidate"
- check: "security_scan"
required: true
on_failure: "escalate_to_cso"
deployments:
- check: "smoke_tests"
required: true
on_failure: "auto_rollback"
- check: "error_rate_baseline"
duration: "5m"
threshold: "< 1% increase"
on_failure: "auto_rollback"
4. Communication Failure
NATS becomes unreachable, messages are lost, or network partitions split the agent fleet.
Impact: Agents cannot coordinate. Tasks may be duplicated or dropped.
Mitigation: NATS clustering, client reconnection logic, and message deduplication.
// NATS connection with resilient reconnection
opts := []nats.Option{
nats.MaxReconnects(-1), // Unlimited reconnection attempts
nats.ReconnectWait(2 * time.Second), // Wait between attempts
nats.ReconnectBufSize(64 * 1024 * 1024), // 64MB buffer during disconnect
nats.DisconnectErrHandler(func(nc *nats.Conn, err error) {
log.Warn("NATS disconnected", "error", err)
metrics.NATSDisconnections.Inc()
}),
nats.ReconnectHandler(func(nc *nats.Conn) {
log.Info("NATS reconnected", "url", nc.ConnectedUrl())
metrics.NATSReconnections.Inc()
}),
}
nc, err := nats.Connect(natsURL, opts...)
5. Cascading Failure
One agent's failure causes dependent agents to fail, creating a chain reaction.
Impact: Fleet-wide outage from a single point of failure.
Mitigation: Circuit breakers, bulkheads, and graceful degradation.
Circuit Breakers for Agent Communication
When an agent repeatedly fails to respond or produces errors, stop sending it work:
type AgentCircuitBreaker struct {
agent string
failureCount int
failureThreshold int
state CircuitState // CLOSED, OPEN, HALF_OPEN
openedAt time.Time
recoveryTimeout time.Duration
}
func (cb *AgentCircuitBreaker) Call(task Task) (Result, error) {
switch cb.state {
case OPEN:
if time.Since(cb.openedAt) > cb.recoveryTimeout {
cb.state = HALF_OPEN
// Allow one test request through
} else {
return Result{}, ErrCircuitOpen
}
case HALF_OPEN:
result, err := cb.sendTask(task)
if err != nil {
cb.state = OPEN
cb.openedAt = time.Now()
return Result{}, err
}
cb.state = CLOSED
cb.failureCount = 0
return result, nil
case CLOSED:
result, err := cb.sendTask(task)
if err != nil {
cb.failureCount++
if cb.failureCount >= cb.failureThreshold {
cb.state = OPEN
cb.openedAt = time.Now()
log.Error("Circuit breaker opened for agent",
"agent", cb.agent,
"failures", cb.failureCount)
// Notify fleet manager
publishAlert(AgentCircuitOpen{Agent: cb.agent})
}
return Result{}, err
}
return result, nil
}
return Result{}, ErrUnknownState
}
Circuit Breaker Configuration per Agent
circuit_breakers:
backend-agent:
failure_threshold: 3
recovery_timeout: 5m
fallback: "queue_for_retry"
devops-agent:
failure_threshold: 2 # Lower threshold for critical agent
recovery_timeout: 2m
fallback: "escalate_to_human"
cso-agent:
failure_threshold: 3
recovery_timeout: 5m
fallback: "proceed_with_warning" # Don't block on security review failure
Health Checks and Watchdogs
Agent Health Check Layers
// Health check endpoint implementation
type HealthChecker struct {
agent *Agent
natsConn *nats.Conn
lastTaskTime time.Time
}
func (h *HealthChecker) LivenessCheck() HealthStatus {
// Am I alive? (process-level)
return HealthStatus{OK: true}
}
func (h *HealthChecker) ReadinessCheck() HealthStatus {
checks := []Check{
{"nats_connected", h.natsConn.IsConnected()},
{"model_available", h.agent.ModelReachable()},
{"workspace_writable", h.agent.WorkspaceWritable()},
{"memory_below_limit", h.agent.MemoryUsagePct() < 90},
}
allPassing := true
for _, check := range checks {
if !check.Pass {
allPassing = false
}
}
return HealthStatus{OK: allPassing, Checks: checks}
}
func (h *HealthChecker) TaskProgress() HealthStatus {
// Has the agent made progress recently?
stuckThreshold := 10 * time.Minute
if h.agent.HasActiveTask() && time.Since(h.lastTaskTime) > stuckThreshold {
return HealthStatus{
OK: false,
Message: "Agent appears stuck - no progress in 10 minutes",
}
}
return HealthStatus{OK: true}
}
Fleet-Level Watchdog
A dedicated watchdog monitors the entire fleet:
class FleetWatchdog:
"""Monitors agent fleet health and triggers corrective actions."""
def __init__(self, agents: list[str], nats_conn):
self.agents = agents
self.nc = nats_conn
self.alert_thresholds = {
"queue_depth": 10,
"stuck_duration": timedelta(minutes=15),
"error_rate": 0.2, # 20% task failure rate
}
async def check_fleet(self):
for agent in self.agents:
status = await self.get_agent_status(agent)
# Check queue depth
if status.queue_depth > self.alert_thresholds["queue_depth"]:
await self.trigger_scale_up(agent)
# Check for stuck agents
if status.time_since_progress > self.alert_thresholds["stuck_duration"]:
await self.restart_agent(agent)
# Check error rate
if status.recent_error_rate > self.alert_thresholds["error_rate"]:
await self.open_circuit_breaker(agent)
await self.notify_human(f"Agent {agent} error rate: {status.recent_error_rate:.0%}")
async def trigger_scale_up(self, agent: str):
"""Scale up agent replicas to handle queue growth."""
current = await self.get_replica_count(agent)
target = min(current + 1, self.max_replicas[agent])
await self.scale_agent(agent, target)
log.info(f"Scaled {agent} from {current} to {target} replicas")
Graceful Degradation
When parts of the fleet fail, the system should degrade gracefully rather than collapse entirely:
Degradation Hierarchy
degradation_levels:
level_0_nominal:
description: "All agents operational"
behavior: "Normal operations"
level_1_reduced:
description: "Some specialist agents unavailable"
behavior: "CTO agent handles specialist tasks directly"
triggers:
- "backend-agent circuit breaker open"
- "frontend-agent unresponsive > 5m"
level_2_essential:
description: "Only critical agents operational"
behavior: "Process only high-priority tasks, queue rest"
triggers:
- "Multiple agents unavailable"
- "NATS cluster degraded (< 3 nodes)"
level_3_emergency:
description: "Fleet severely impaired"
behavior: "Alert humans, pause non-critical work, protect data"
triggers:
- "CEO or CTO agent unavailable"
- "NATS completely unreachable"
- "> 50% of fleet in error state"
Implementing Fallback Chains
// Task routing with fallback
func routeTask(task Task) error {
// Primary: route to specialist
err := sendToAgent(task.PrimaryAgent, task)
if err == nil {
return nil
}
// Fallback 1: route to manager agent
log.Warn("Primary agent unavailable, using fallback",
"primary", task.PrimaryAgent, "fallback", task.FallbackAgent)
err = sendToAgent(task.FallbackAgent, task)
if err == nil {
return nil
}
// Fallback 2: queue for later processing
log.Warn("Fallback agent unavailable, queueing task",
"task", task.ID)
return queueForRetry(task, 5*time.Minute)
}
Self-Healing Patterns
Automatic Agent Restart on Failure
# Kubernetes will restart crashed agents
apiVersion: apps/v1
kind: Deployment
spec:
template:
spec:
containers:
- name: agent
livenessProbe:
httpGet:
path: /health/live
port: 8080
initialDelaySeconds: 60
periodSeconds: 30
failureThreshold: 3 # Restart after 3 failed checks (90s)
Automatic Task Retry
// NATS consumer with automatic redelivery
consumerConfig := &nats.ConsumerConfig{
Durable: "backend-tasks",
AckPolicy: nats.AckExplicitPolicy,
MaxDeliver: 3, // Try 3 times
AckWait: 30 * time.Minute, // Allow 30 min per attempt
BackOff: []time.Duration{ // Exponential backoff
1 * time.Minute,
5 * time.Minute,
15 * time.Minute,
},
}
Automatic Fleet Rebalancing
When an agent instance is lost, its pending tasks are automatically rebalanced:
async def rebalance_on_agent_loss(lost_agent: str):
"""When an agent instance dies, redistribute its unacked messages."""
# NATS JetStream handles this automatically for consumer groups!
# Unacknowledged messages are redelivered to surviving instances.
# But we should also handle the organizational impact:
# Notify the manager agent
await send_message(
to=lost_agent_manager,
type="notification",
payload={
"event": "agent_instance_lost",
"agent": lost_agent,
"pending_tasks": await get_pending_tasks(lost_agent),
"action_taken": "Tasks will be redelivered to surviving replicas",
"human_action_needed": len(get_replicas(lost_agent)) == 0
}
)
Monitoring and Alerting
Key Metrics for Agent Fleet Health
# Prometheus metrics exposed by each agent
metrics:
# Task metrics
- agent_tasks_received_total{agent, type}
- agent_tasks_completed_total{agent, type, status}
- agent_task_duration_seconds{agent, type} # Histogram
- agent_tasks_in_progress{agent}
# Communication metrics
- agent_messages_sent_total{agent, destination, type}
- agent_messages_received_total{agent, source, type}
- agent_message_latency_seconds{agent} # Histogram
# Health metrics
- agent_health_status{agent} # 1=healthy, 0=unhealthy
- agent_last_activity_timestamp{agent}
- agent_memory_usage_bytes{agent}
- agent_context_window_usage_pct{agent}
# Fleet metrics
- fleet_agents_healthy_total
- fleet_agents_unhealthy_total
- fleet_total_queue_depth
- fleet_circuit_breakers_open_total
Alert Rules
# Prometheus alert rules
groups:
- name: agent_fleet_alerts
rules:
- alert: AgentStuck
expr: time() - agent_last_activity_timestamp > 600
for: 5m
labels:
severity: warning
annotations:
summary: "Agent {{ $labels.agent }} has not made progress in 10+ minutes"
- alert: QueueDepthHigh
expr: fleet_total_queue_depth > 20
for: 10m
labels:
severity: warning
annotations:
summary: "Fleet queue depth is {{ $value }}, consider scaling"
- alert: AgentErrorRateHigh
expr: rate(agent_tasks_completed_total{status="failed"}[5m]) / rate(agent_tasks_completed_total[5m]) > 0.2
for: 5m
labels:
severity: critical
annotations:
summary: "Agent {{ $labels.agent }} failing > 20% of tasks"
- alert: FleetDegraded
expr: fleet_agents_unhealthy_total / (fleet_agents_healthy_total + fleet_agents_unhealthy_total) > 0.3
for: 2m
labels:
severity: critical
annotations:
summary: "More than 30% of agent fleet is unhealthy"
Chaos Engineering for Agent Fleets
Test resilience proactively by injecting failures:
# Chaos experiment: kill a random agent
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
name: agent-kill-test
spec:
action: pod-kill
mode: one
selector:
labelSelectors:
"app.kubernetes.io/component": "agent"
scheduler:
cron: "@weekly"
# Chaos experiment: network partition between agents and NATS
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
name: nats-partition-test
spec:
action: partition
mode: all
selector:
labelSelectors:
"app.kubernetes.io/component": "agent"
direction: both
target:
selector:
labelSelectors:
"app.kubernetes.io/name": "nats"
duration: "2m"
Resilience Checklist
Before deploying an agent fleet to production, verify:
- All NATS consumers use explicit acknowledgment
- Task timeouts are configured for every task type
- Circuit breakers protect inter-agent communication
- Dead letter queues capture repeatedly failed messages
- Health checks cover liveness, readiness, and progress
- Fleet watchdog monitors aggregate health
- Graceful degradation levels are defined and tested
- Alerts fire before users notice problems
- Chaos experiments validate recovery paths
- Rollback procedures exist for agent-initiated changes
For the infrastructure foundations that enable these resilience patterns, see NATS JetStream for AI Agent Communication and Agent Lifecycle Management. For the broader platform architecture, read Architecture of agent.ceo and AI-Powered DevOps.
Continue reading: Explore the architecture behind agent.ceo, learn about scaling AI agents to 100 concurrent workers, or get started with our 5-minute quickstart guide.
For more on agent communication, see NATS JetStream for AI agents and agent-to-agent messaging patterns.
Whether you choose the hosted SaaS platform or a private enterprise installation, agent.ceo delivers the same autonomous workforce capabilities.
Try agent.ceo
SaaS — Get started with 1 free agent-week at agent.ceo.
Enterprise — For private installation on your own infrastructure, contact enterprise@agent.ceo.
agent.ceo is built by GenBrain AI — a GenAI-first autonomous agent orchestration platform. General inquiries: hello@agent.ceo | Security: security@agent.ceo