Skip to main content
Back to blog
Technical10 min read

Building Resilient AI Agent Fleets

M
Moshe Beeri, Founder
/
resiliencereliabilityfault-toleranceoperationsmonitoring

A single AI agent crashing is an inconvenience. Your entire agent fleet going down because of a cascading failure is a catastrophe. When agents own real workflows — deploying code, managing infrastructure, handling security incidents — resilience is not optional. It is a core architectural requirement.

This guide covers the patterns, practices, and tooling required to build agent fleets that self-heal, degrade gracefully, and maintain organizational continuity even when individual agents fail.

Failure Modes in Agent Fleets

Rendering diagram…

AI agents fail differently from traditional services. Understanding the failure modes is the first step to handling them:

1. Agent Crash (Process Death)

The agent process terminates unexpectedly — OOM kill, node failure, or runtime panic.

Impact: In-flight task is interrupted. Messages in the agent's buffer are lost.

Mitigation: Durable NATS consumers with explicit acknowledgment ensure no messages are lost. Kubernetes restart policies bring the agent back.

# Kubernetes restart policy
spec:
  containers:
    - name: agent
      resources:
        limits:
          memory: "4Gi"  # Set limits to prevent OOM from affecting node
      livenessProbe:
        httpGet:
          path: /health/live
          port: 8080
        initialDelaySeconds: 30
        periodSeconds: 30
        failureThreshold: 3
  restartPolicy: Always

2. Agent Stuck (Infinite Loop or Deadlock)

The agent is alive but not making progress — stuck in a reasoning loop, waiting for a response that will never come, or endlessly retrying a failing operation.

Impact: Task never completes. Queue depth grows. Dependent agents block.

Mitigation: Task timeouts, progress watchdogs, and circuit breakers.

# Task timeout configuration
class TaskExecutor:
    def __init__(self, max_duration=timedelta(minutes=30)):
        self.max_duration = max_duration
        self.last_progress = datetime.now()
        self.progress_timeout = timedelta(minutes=5)
    
    def execute(self, task):
        start = datetime.now()
        while not task.complete:
            if datetime.now() - start > self.max_duration:
                raise TaskTimeoutError(f"Task exceeded {self.max_duration}")
            if datetime.now() - self.last_progress > self.progress_timeout:
                raise StuckDetectedError("No progress in 5 minutes")
            
            # Execute next step
            step_result = self.agent.next_step(task)
            self.last_progress = datetime.now()
            
            # Publish progress
            self.publish_progress(task, step_result)

3. Agent Producing Bad Output

The agent completes tasks but produces incorrect results — buggy code, wrong configurations, flawed security assessments.

Impact: Downstream agents and systems consume bad output. Errors compound.

Mitigation: Output validation, peer review gates, and rollback capabilities.

# Output validation pipeline
validation:
  code_changes:
    - check: "tests_pass"
      required: true
      on_failure: "reject_and_retry"
    - check: "lint_clean"
      required: true
      on_failure: "auto_fix_and_revalidate"
    - check: "security_scan"
      required: true
      on_failure: "escalate_to_cso"
  deployments:
    - check: "smoke_tests"
      required: true
      on_failure: "auto_rollback"
    - check: "error_rate_baseline"
      duration: "5m"
      threshold: "< 1% increase"
      on_failure: "auto_rollback"

4. Communication Failure

NATS becomes unreachable, messages are lost, or network partitions split the agent fleet.

Impact: Agents cannot coordinate. Tasks may be duplicated or dropped.

Mitigation: NATS clustering, client reconnection logic, and message deduplication.

// NATS connection with resilient reconnection
opts := []nats.Option{
    nats.MaxReconnects(-1),              // Unlimited reconnection attempts
    nats.ReconnectWait(2 * time.Second), // Wait between attempts
    nats.ReconnectBufSize(64 * 1024 * 1024), // 64MB buffer during disconnect
    nats.DisconnectErrHandler(func(nc *nats.Conn, err error) {
        log.Warn("NATS disconnected", "error", err)
        metrics.NATSDisconnections.Inc()
    }),
    nats.ReconnectHandler(func(nc *nats.Conn) {
        log.Info("NATS reconnected", "url", nc.ConnectedUrl())
        metrics.NATSReconnections.Inc()
    }),
}

nc, err := nats.Connect(natsURL, opts...)

5. Cascading Failure

One agent's failure causes dependent agents to fail, creating a chain reaction.

Impact: Fleet-wide outage from a single point of failure.

Mitigation: Circuit breakers, bulkheads, and graceful degradation.

Circuit Breakers for Agent Communication

When an agent repeatedly fails to respond or produces errors, stop sending it work:

type AgentCircuitBreaker struct {
    agent           string
    failureCount    int
    failureThreshold int
    state           CircuitState  // CLOSED, OPEN, HALF_OPEN
    openedAt        time.Time
    recoveryTimeout time.Duration
}

func (cb *AgentCircuitBreaker) Call(task Task) (Result, error) {
    switch cb.state {
    case OPEN:
        if time.Since(cb.openedAt) > cb.recoveryTimeout {
            cb.state = HALF_OPEN
            // Allow one test request through
        } else {
            return Result{}, ErrCircuitOpen
        }
    case HALF_OPEN:
        result, err := cb.sendTask(task)
        if err != nil {
            cb.state = OPEN
            cb.openedAt = time.Now()
            return Result{}, err
        }
        cb.state = CLOSED
        cb.failureCount = 0
        return result, nil
    case CLOSED:
        result, err := cb.sendTask(task)
        if err != nil {
            cb.failureCount++
            if cb.failureCount >= cb.failureThreshold {
                cb.state = OPEN
                cb.openedAt = time.Now()
                log.Error("Circuit breaker opened for agent",
                    "agent", cb.agent,
                    "failures", cb.failureCount)
                // Notify fleet manager
                publishAlert(AgentCircuitOpen{Agent: cb.agent})
            }
            return Result{}, err
        }
        return result, nil
    }
    return Result{}, ErrUnknownState
}

Circuit Breaker Configuration per Agent

circuit_breakers:
  backend-agent:
    failure_threshold: 3
    recovery_timeout: 5m
    fallback: "queue_for_retry"
  
  devops-agent:
    failure_threshold: 2      # Lower threshold for critical agent
    recovery_timeout: 2m
    fallback: "escalate_to_human"
  
  cso-agent:
    failure_threshold: 3
    recovery_timeout: 5m
    fallback: "proceed_with_warning"  # Don't block on security review failure

Health Checks and Watchdogs

Agent Health Check Layers

// Health check endpoint implementation
type HealthChecker struct {
    agent        *Agent
    natsConn     *nats.Conn
    lastTaskTime time.Time
}

func (h *HealthChecker) LivenessCheck() HealthStatus {
    // Am I alive? (process-level)
    return HealthStatus{OK: true}
}

func (h *HealthChecker) ReadinessCheck() HealthStatus {
    checks := []Check{
        {"nats_connected", h.natsConn.IsConnected()},
        {"model_available", h.agent.ModelReachable()},
        {"workspace_writable", h.agent.WorkspaceWritable()},
        {"memory_below_limit", h.agent.MemoryUsagePct() < 90},
    }
    
    allPassing := true
    for _, check := range checks {
        if !check.Pass {
            allPassing = false
        }
    }
    
    return HealthStatus{OK: allPassing, Checks: checks}
}

func (h *HealthChecker) TaskProgress() HealthStatus {
    // Has the agent made progress recently?
    stuckThreshold := 10 * time.Minute
    if h.agent.HasActiveTask() && time.Since(h.lastTaskTime) > stuckThreshold {
        return HealthStatus{
            OK:      false,
            Message: "Agent appears stuck - no progress in 10 minutes",
        }
    }
    return HealthStatus{OK: true}
}

Fleet-Level Watchdog

A dedicated watchdog monitors the entire fleet:

class FleetWatchdog:
    """Monitors agent fleet health and triggers corrective actions."""
    
    def __init__(self, agents: list[str], nats_conn):
        self.agents = agents
        self.nc = nats_conn
        self.alert_thresholds = {
            "queue_depth": 10,
            "stuck_duration": timedelta(minutes=15),
            "error_rate": 0.2,  # 20% task failure rate
        }
    
    async def check_fleet(self):
        for agent in self.agents:
            status = await self.get_agent_status(agent)
            
            # Check queue depth
            if status.queue_depth > self.alert_thresholds["queue_depth"]:
                await self.trigger_scale_up(agent)
            
            # Check for stuck agents
            if status.time_since_progress > self.alert_thresholds["stuck_duration"]:
                await self.restart_agent(agent)
            
            # Check error rate
            if status.recent_error_rate > self.alert_thresholds["error_rate"]:
                await self.open_circuit_breaker(agent)
                await self.notify_human(f"Agent {agent} error rate: {status.recent_error_rate:.0%}")
    
    async def trigger_scale_up(self, agent: str):
        """Scale up agent replicas to handle queue growth."""
        current = await self.get_replica_count(agent)
        target = min(current + 1, self.max_replicas[agent])
        await self.scale_agent(agent, target)
        log.info(f"Scaled {agent} from {current} to {target} replicas")

Graceful Degradation

When parts of the fleet fail, the system should degrade gracefully rather than collapse entirely:

Degradation Hierarchy

degradation_levels:
  level_0_nominal:
    description: "All agents operational"
    behavior: "Normal operations"
  
  level_1_reduced:
    description: "Some specialist agents unavailable"
    behavior: "CTO agent handles specialist tasks directly"
    triggers:
      - "backend-agent circuit breaker open"
      - "frontend-agent unresponsive > 5m"
  
  level_2_essential:
    description: "Only critical agents operational"
    behavior: "Process only high-priority tasks, queue rest"
    triggers:
      - "Multiple agents unavailable"
      - "NATS cluster degraded (< 3 nodes)"
  
  level_3_emergency:
    description: "Fleet severely impaired"
    behavior: "Alert humans, pause non-critical work, protect data"
    triggers:
      - "CEO or CTO agent unavailable"
      - "NATS completely unreachable"
      - "> 50% of fleet in error state"

Implementing Fallback Chains

// Task routing with fallback
func routeTask(task Task) error {
    // Primary: route to specialist
    err := sendToAgent(task.PrimaryAgent, task)
    if err == nil {
        return nil
    }
    
    // Fallback 1: route to manager agent
    log.Warn("Primary agent unavailable, using fallback",
        "primary", task.PrimaryAgent, "fallback", task.FallbackAgent)
    err = sendToAgent(task.FallbackAgent, task)
    if err == nil {
        return nil
    }
    
    // Fallback 2: queue for later processing
    log.Warn("Fallback agent unavailable, queueing task",
        "task", task.ID)
    return queueForRetry(task, 5*time.Minute)
}

Self-Healing Patterns

Automatic Agent Restart on Failure

# Kubernetes will restart crashed agents
apiVersion: apps/v1
kind: Deployment
spec:
  template:
    spec:
      containers:
        - name: agent
          livenessProbe:
            httpGet:
              path: /health/live
              port: 8080
            initialDelaySeconds: 60
            periodSeconds: 30
            failureThreshold: 3  # Restart after 3 failed checks (90s)

Automatic Task Retry

// NATS consumer with automatic redelivery
consumerConfig := &nats.ConsumerConfig{
    Durable:        "backend-tasks",
    AckPolicy:      nats.AckExplicitPolicy,
    MaxDeliver:     3,                      // Try 3 times
    AckWait:        30 * time.Minute,       // Allow 30 min per attempt
    BackOff:        []time.Duration{        // Exponential backoff
        1 * time.Minute,
        5 * time.Minute,
        15 * time.Minute,
    },
}

Automatic Fleet Rebalancing

When an agent instance is lost, its pending tasks are automatically rebalanced:

async def rebalance_on_agent_loss(lost_agent: str):
    """When an agent instance dies, redistribute its unacked messages."""
    
    # NATS JetStream handles this automatically for consumer groups!
    # Unacknowledged messages are redelivered to surviving instances.
    
    # But we should also handle the organizational impact:
    # Notify the manager agent
    await send_message(
        to=lost_agent_manager,
        type="notification",
        payload={
            "event": "agent_instance_lost",
            "agent": lost_agent,
            "pending_tasks": await get_pending_tasks(lost_agent),
            "action_taken": "Tasks will be redelivered to surviving replicas",
            "human_action_needed": len(get_replicas(lost_agent)) == 0
        }
    )

Monitoring and Alerting

Key Metrics for Agent Fleet Health

# Prometheus metrics exposed by each agent
metrics:
  # Task metrics
  - agent_tasks_received_total{agent, type}
  - agent_tasks_completed_total{agent, type, status}
  - agent_task_duration_seconds{agent, type}  # Histogram
  - agent_tasks_in_progress{agent}
  
  # Communication metrics
  - agent_messages_sent_total{agent, destination, type}
  - agent_messages_received_total{agent, source, type}
  - agent_message_latency_seconds{agent}  # Histogram
  
  # Health metrics
  - agent_health_status{agent}  # 1=healthy, 0=unhealthy
  - agent_last_activity_timestamp{agent}
  - agent_memory_usage_bytes{agent}
  - agent_context_window_usage_pct{agent}
  
  # Fleet metrics
  - fleet_agents_healthy_total
  - fleet_agents_unhealthy_total
  - fleet_total_queue_depth
  - fleet_circuit_breakers_open_total

Alert Rules

# Prometheus alert rules
groups:
  - name: agent_fleet_alerts
    rules:
      - alert: AgentStuck
        expr: time() - agent_last_activity_timestamp > 600
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Agent {{ $labels.agent }} has not made progress in 10+ minutes"
      
      - alert: QueueDepthHigh
        expr: fleet_total_queue_depth > 20
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Fleet queue depth is {{ $value }}, consider scaling"
      
      - alert: AgentErrorRateHigh
        expr: rate(agent_tasks_completed_total{status="failed"}[5m]) / rate(agent_tasks_completed_total[5m]) > 0.2
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Agent {{ $labels.agent }} failing > 20% of tasks"
      
      - alert: FleetDegraded
        expr: fleet_agents_unhealthy_total / (fleet_agents_healthy_total + fleet_agents_unhealthy_total) > 0.3
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "More than 30% of agent fleet is unhealthy"

Chaos Engineering for Agent Fleets

Test resilience proactively by injecting failures:

# Chaos experiment: kill a random agent
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
  name: agent-kill-test
spec:
  action: pod-kill
  mode: one
  selector:
    labelSelectors:
      "app.kubernetes.io/component": "agent"
  scheduler:
    cron: "@weekly"

# Chaos experiment: network partition between agents and NATS
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: nats-partition-test
spec:
  action: partition
  mode: all
  selector:
    labelSelectors:
      "app.kubernetes.io/component": "agent"
  direction: both
  target:
    selector:
      labelSelectors:
        "app.kubernetes.io/name": "nats"
  duration: "2m"

Resilience Checklist

Before deploying an agent fleet to production, verify:

  • All NATS consumers use explicit acknowledgment
  • Task timeouts are configured for every task type
  • Circuit breakers protect inter-agent communication
  • Dead letter queues capture repeatedly failed messages
  • Health checks cover liveness, readiness, and progress
  • Fleet watchdog monitors aggregate health
  • Graceful degradation levels are defined and tested
  • Alerts fire before users notice problems
  • Chaos experiments validate recovery paths
  • Rollback procedures exist for agent-initiated changes

For the infrastructure foundations that enable these resilience patterns, see NATS JetStream for AI Agent Communication and Agent Lifecycle Management. For the broader platform architecture, read Architecture of agent.ceo and AI-Powered DevOps.

Continue reading: Explore the architecture behind agent.ceo, learn about scaling AI agents to 100 concurrent workers, or get started with our 5-minute quickstart guide.

For more on agent communication, see NATS JetStream for AI agents and agent-to-agent messaging patterns.

Whether you choose the hosted SaaS platform or a private enterprise installation, agent.ceo delivers the same autonomous workforce capabilities.

Try agent.ceo

SaaS — Get started with 1 free agent-week at agent.ceo.

Enterprise — For private installation on your own infrastructure, contact enterprise@agent.ceo.


agent.ceo is built by GenBrain AI — a GenAI-first autonomous agent orchestration platform. General inquiries: hello@agent.ceo | Security: security@agent.ceo

Related articles