Skip to main content
Back to blog
Technical9 min read

Monitoring Your AI Agent Fleet

M
Moshe Beeri, Founder
/
monitoringobservabilitydashboardmetricscostsperformancetutorial

What to Monitor

Running autonomous AI agents without monitoring is like deploying servers without observability. You need visibility into what your agents are doing, how they are performing, what they are costing, and when they need attention. This guide covers setting up comprehensive monitoring for your agent.ceo fleet.

By the end of this tutorial, you will have a fully configured monitoring stack with dashboards, alerts, and cost tracking for all your agents.

Rendering diagram…

Effective agent monitoring covers four dimensions:

  1. Health: Is the agent running and responsive?
  2. Performance: How quickly and accurately does it complete tasks?
  3. Costs: How much is each agent costing in compute and API usage?
  4. Impact: What value is the agent delivering to your team?

Prerequisites

  • At least one deployed agent (see Getting Started)
  • Access to the agent.ceo dashboard
  • Slack connected for alert notifications (recommended)

Step 1: Access the Monitoring Dashboard

Navigate to Dashboard > Fleet Overview in your agent.ceo account:

# Or via CLI
agentceo fleet status

# Output:
# Fleet Overview
# ==============
# Total Agents: 5
# Status: 4 Active, 1 Idle
# Tasks Today: 47 completed, 3 in progress, 1 blocked
# Cost Today: $12.34
#
# Agent           Role              Status    Tasks    Cost
# ─────────────────────────────────────────────────────────
# CodeReviewer    code-reviewer     Active    23       $4.12
# SecurityBot     security-analyst  Active    8        $3.45
# DevOpsAgent     devops-engineer   Active    12       $2.89
# ReleaseBot      release-manager   Idle      0        $0.00
# BackendDev      backend-developer Active    7        $1.88

Step 2: Configure Health Monitoring

Set up health checks that detect when agents are struggling:

# health-monitoring.yaml
health_monitoring:
  checks:
    heartbeat:
      interval: "30s"
      timeout: "10s"
      failure_threshold: 3
      action: "restart-agent"
    
    task_queue:
      max_queue_depth: 10
      max_wait_time: "5m"
      action: "alert-and-scale"
    
    memory_usage:
      warning_threshold: "80%"
      critical_threshold: "95%"
      action: "alert"
    
    error_rate:
      window: "5m"
      warning_threshold: "10%"
      critical_threshold: "25%"
      action: "alert-and-pause"

  alerts:
    channels:
      - slack: "#agent-alerts"
      - email: "oncall@company.com"
    
    rules:
      - name: "Agent Down"
        condition: "heartbeat.failed >= 3"
        severity: critical
        message: "Agent {agent_name} is unresponsive"
      
      - name: "High Error Rate"
        condition: "error_rate > 25% for 5m"
        severity: high
        message: "Agent {agent_name} error rate at {error_rate}%"
      
      - name: "Queue Backlog"
        condition: "queue_depth > 10"
        severity: medium
        message: "Agent {agent_name} has {queue_depth} pending tasks"
# Apply health monitoring configuration
agentceo monitoring configure \
  --config health-monitoring.yaml

Step 3: Track Performance Metrics

Monitor how effectively your agents complete their work:

# performance-metrics.yaml
performance_metrics:
  task_metrics:
    - name: "task_completion_time"
      description: "Time from task assignment to completion"
      aggregations: [p50, p90, p99, avg]
      alert_threshold:
        p90: "10m"  # Alert if 90th percentile exceeds 10 minutes
    
    - name: "task_success_rate"
      description: "Percentage of tasks completed without error"
      aggregations: [hourly, daily, weekly]
      alert_threshold:
        daily: "95%"  # Alert if success rate drops below 95%
    
    - name: "human_override_rate"
      description: "Percentage of agent decisions overridden by humans"
      aggregations: [daily, weekly]
      alert_threshold:
        weekly: "15%"  # Alert if override rate exceeds 15%
    
    - name: "escalation_rate"
      description: "Percentage of tasks escalated to humans"
      aggregations: [daily, weekly]
      target: "< 10%"

  code_review_metrics:
    - name: "review_response_time"
      description: "Time from PR opened to first review comment"
      target: "< 2m"
    
    - name: "review_accuracy"
      description: "Percentage of review comments accepted by PR author"
      target: "> 80%"
    
    - name: "issues_caught"
      description: "Bugs/vulnerabilities detected before merge"
      track: [count, severity, category]
# View performance metrics
agentceo metrics show --agent CodeReviewer --period 7d

# Output:
# CodeReviewer - Performance (Last 7 Days)
# =========================================
# Task Completion Time (p50): 45s
# Task Completion Time (p90): 2m 12s
# Task Success Rate: 98.2%
# Human Override Rate: 4.1%
# Review Response Time: 38s
# Issues Caught: 12 (3 high, 6 medium, 3 low)

Step 4: Set Up Cost Monitoring

Track and control agent spending:

# cost-monitoring.yaml
cost_monitoring:
  tracking:
    granularity: "per-task"  # Track cost per task, per agent, per team
    breakdown:
      - compute_costs     # CPU and memory
      - api_costs         # Claude API usage
      - storage_costs     # Workspace storage
      - network_costs     # Data transfer
  
  budgets:
    team_daily:
      limit: "$50"
      warning_at: "80%"  # Alert at $40
      action_at: "100%"  # Pause non-critical agents at $50
    
    agent_daily:
      CodeReviewer: "$15"
      SecurityBot: "$10"
      DevOpsAgent: "$10"
      BackendDev: "$15"
    
    monthly_total:
      limit: "$1000"
      warning_at: "70%"
      action_at: "90%"
  
  optimization:
    idle_detection:
      threshold: "15m"
      action: "scale-to-zero"
    
    batch_processing:
      enabled: true
      batch_window: "5m"
      # Group similar tasks to reduce cold starts
    
    spot_instances:
      enabled: true
      fallback_to_on_demand: true
      # Use spot for non-urgent tasks
# View cost breakdown
agentceo costs show --period 30d --breakdown agent

# Output:
# Cost Report (Last 30 Days)
# ==========================
# Total: $487.23
#
# Agent           Compute   API       Storage   Total
# ─────────────────────────────────────────────────────
# CodeReviewer    $42.00    $112.45   $2.10     $156.55
# SecurityBot     $38.00    $78.90    $5.20     $122.10
# DevOpsAgent     $35.00    $52.30    $1.80     $89.10
# BackendDev      $40.00    $65.48   $4.00     $109.48
# ReleaseBot      $5.00     $3.50     $1.50     $10.00
#
# Savings from optimization: $82.40 (spot instances, idle detection)

For advanced cost optimization strategies, see our dedicated guide on cost optimization for AI agents.

Step 5: Create Custom Dashboards

Build dashboards tailored to different audiences:

# dashboards.yaml
dashboards:
  - name: "Engineering Lead"
    widgets:
      - type: "fleet-status"
        position: [0, 0, 6, 2]
      - type: "tasks-completed-today"
        position: [6, 0, 6, 2]
      - type: "cost-burn-rate"
        position: [0, 2, 4, 2]
      - type: "top-issues-caught"
        position: [4, 2, 4, 2]
      - type: "human-override-log"
        position: [8, 2, 4, 2]
  
  - name: "DevOps"
    widgets:
      - type: "pod-health"
        position: [0, 0, 6, 2]
      - type: "resource-utilization"
        position: [6, 0, 6, 2]
      - type: "error-log"
        position: [0, 2, 12, 3]
  
  - name: "Executive"
    widgets:
      - type: "monthly-cost-summary"
        position: [0, 0, 4, 2]
      - type: "productivity-impact"
        position: [4, 0, 4, 2]
      - type: "roi-calculator"
        position: [8, 0, 4, 2]
# Create a custom dashboard
agentceo dashboard create "Engineering Lead" \
  --config dashboards.yaml

# Share dashboard with team
agentceo dashboard share "Engineering Lead" \
  --team "Engineering"

Step 6: Configure Alert Routing

Set up intelligent alerting that avoids noise:

# alert-routing.yaml
alert_routing:
  # Deduplicate similar alerts
  deduplication:
    window: "15m"
    group_by: [agent_name, alert_type]
  
  # Route alerts based on severity and type
  routes:
    - match:
        severity: critical
      notify:
        - slack: "#agent-alerts"
        - pagerduty: "agent-fleet"
        - sms: "+1234567890"
      repeat_interval: "5m"
    
    - match:
        severity: high
      notify:
        - slack: "#agent-alerts"
        - email: "engineering-leads@company.com"
      repeat_interval: "30m"
    
    - match:
        severity: medium
      notify:
        - slack: "#agent-status"
      repeat_interval: "2h"
    
    - match:
        severity: low
        type: informational
      notify:
        - dashboard-only: true
      repeat_interval: "24h"
  
  # Quiet hours (suppress non-critical alerts)
  quiet_hours:
    schedule: "22:00-08:00 UTC on weekdays, all day weekends"
    suppress_below: "high"
    buffer_and_send: "08:00 UTC"
# Apply alert routing
agentceo alerts configure --config alert-routing.yaml

# Test alert routing
agentceo alerts test --severity high --message "Test alert"

Step 7: Set Up Activity Logging

Maintain detailed logs of all agent actions for auditing and debugging:

# logging-config.yaml
logging:
  # What to log
  capture:
    - task-assignments
    - task-completions
    - tool-invocations
    - git-operations
    - api-calls
    - decisions-made
    - errors-encountered
    - escalations
  
  # Retention
  retention:
    hot_storage: "7 days"    # Fast query, full detail
    warm_storage: "30 days"  # Slower query, full detail
    cold_storage: "1 year"   # Archive, summary only
  
  # Export integrations
  export:
    datadog:
      enabled: true
      api_key_ref: "secrets/datadog-api-key"
      tags: ["service:agent-ceo", "env:production"]
    
    elasticsearch:
      enabled: true
      url: "https://elastic.company.com:9200"
      index_prefix: "agent-ceo-logs"
# Query agent activity logs
agentceo logs query \
  --agent CodeReviewer \
  --action "git-operations" \
  --since "24h" \
  --limit 50

# Export logs for analysis
agentceo logs export \
  --format json \
  --since "7d" \
  --output agent-logs-week.json

Step 8: Monitor Agent Collaboration

Track how agents work together within teams:

# View collaboration metrics
agentceo team metrics "Core Engineering" --type collaboration

# Output:
# Team Collaboration Metrics (Last 7 Days)
# =========================================
# Inter-agent messages: 34
# Coordinated tasks: 8
# Handoffs: 12 (success rate: 100%)
# Shared context references: 45
# Conflict resolutions: 2 (both resolved autonomously)
#
# Most active collaboration:
# SecurityBot -> CodeReviewer: 15 interactions
# (Security findings triggering review blocks)

Monitoring Best Practices

  1. Start with defaults: agent.ceo provides sensible default monitoring. Customize only after you understand baseline behavior.

  2. Alert on symptoms, not causes: Alert when task completion rate drops, not when CPU hits 80%. The former requires action; the latter might be normal.

  3. Review weekly: Schedule a 10-minute weekly review of agent metrics. Look for trends, not individual events.

  4. Correlate with team productivity: Track whether agent deployment correlates with faster PR merge times, fewer production incidents, or reduced toil.

  5. Set cost budgets early: It is easier to increase budgets than to explain unexpected bills.

  6. Use dashboards for different audiences: Engineers need technical detail; executives need impact and ROI.

For deeper integration with enterprise monitoring tools, see our guide on real-time agent monitoring. To understand how monitoring fits into the broader platform, explore the agent.ceo architecture.

Integrating with Existing Observability

agent.ceo exports metrics in standard formats compatible with:

  • Prometheus/Grafana: Scrape metrics endpoint at /metrics
  • Datadog: Native integration via API key
  • New Relic: OpenTelemetry export
  • Splunk: HEC (HTTP Event Collector) integration
  • CloudWatch/Cloud Monitoring: Cloud-native integrations
# Enable Prometheus metrics endpoint
agentceo monitoring enable prometheus \
  --port 9090 \
  --path /metrics

# Enable OpenTelemetry export
agentceo monitoring enable otlp \
  --endpoint "https://otel-collector.company.com:4317"

GenBrain AI is the company behind agent.ceo, building the next generation of autonomous agent orchestration.

Try agent.ceo

SaaS — Get started with 1 free agent-week at agent.ceo.

Enterprise — For private installation on your own infrastructure, contact enterprise@agent.ceo.


agent.ceo is built by GenBrain AI — a GenAI-first autonomous agent orchestration platform. General inquiries: hello@agent.ceo | Security: security@agent.ceo

Related articles