What to Monitor
Running autonomous AI agents without monitoring is like deploying servers without observability. You need visibility into what your agents are doing, how they are performing, what they are costing, and when they need attention. This guide covers setting up comprehensive monitoring for your agent.ceo fleet.
By the end of this tutorial, you will have a fully configured monitoring stack with dashboards, alerts, and cost tracking for all your agents.
Rendering diagram…
Effective agent monitoring covers four dimensions:
- Health: Is the agent running and responsive?
- Performance: How quickly and accurately does it complete tasks?
- Costs: How much is each agent costing in compute and API usage?
- Impact: What value is the agent delivering to your team?
Prerequisites
- At least one deployed agent (see Getting Started)
- Access to the agent.ceo dashboard
- Slack connected for alert notifications (recommended)
Step 1: Access the Monitoring Dashboard
Navigate to Dashboard > Fleet Overview in your agent.ceo account:
# Or via CLI
agentceo fleet status
# Output:
# Fleet Overview
# ==============
# Total Agents: 5
# Status: 4 Active, 1 Idle
# Tasks Today: 47 completed, 3 in progress, 1 blocked
# Cost Today: $12.34
#
# Agent Role Status Tasks Cost
# ─────────────────────────────────────────────────────────
# CodeReviewer code-reviewer Active 23 $4.12
# SecurityBot security-analyst Active 8 $3.45
# DevOpsAgent devops-engineer Active 12 $2.89
# ReleaseBot release-manager Idle 0 $0.00
# BackendDev backend-developer Active 7 $1.88
Step 2: Configure Health Monitoring
Set up health checks that detect when agents are struggling:
# health-monitoring.yaml
health_monitoring:
checks:
heartbeat:
interval: "30s"
timeout: "10s"
failure_threshold: 3
action: "restart-agent"
task_queue:
max_queue_depth: 10
max_wait_time: "5m"
action: "alert-and-scale"
memory_usage:
warning_threshold: "80%"
critical_threshold: "95%"
action: "alert"
error_rate:
window: "5m"
warning_threshold: "10%"
critical_threshold: "25%"
action: "alert-and-pause"
alerts:
channels:
- slack: "#agent-alerts"
- email: "oncall@company.com"
rules:
- name: "Agent Down"
condition: "heartbeat.failed >= 3"
severity: critical
message: "Agent {agent_name} is unresponsive"
- name: "High Error Rate"
condition: "error_rate > 25% for 5m"
severity: high
message: "Agent {agent_name} error rate at {error_rate}%"
- name: "Queue Backlog"
condition: "queue_depth > 10"
severity: medium
message: "Agent {agent_name} has {queue_depth} pending tasks"
# Apply health monitoring configuration
agentceo monitoring configure \
--config health-monitoring.yaml
Step 3: Track Performance Metrics
Monitor how effectively your agents complete their work:
# performance-metrics.yaml
performance_metrics:
task_metrics:
- name: "task_completion_time"
description: "Time from task assignment to completion"
aggregations: [p50, p90, p99, avg]
alert_threshold:
p90: "10m" # Alert if 90th percentile exceeds 10 minutes
- name: "task_success_rate"
description: "Percentage of tasks completed without error"
aggregations: [hourly, daily, weekly]
alert_threshold:
daily: "95%" # Alert if success rate drops below 95%
- name: "human_override_rate"
description: "Percentage of agent decisions overridden by humans"
aggregations: [daily, weekly]
alert_threshold:
weekly: "15%" # Alert if override rate exceeds 15%
- name: "escalation_rate"
description: "Percentage of tasks escalated to humans"
aggregations: [daily, weekly]
target: "< 10%"
code_review_metrics:
- name: "review_response_time"
description: "Time from PR opened to first review comment"
target: "< 2m"
- name: "review_accuracy"
description: "Percentage of review comments accepted by PR author"
target: "> 80%"
- name: "issues_caught"
description: "Bugs/vulnerabilities detected before merge"
track: [count, severity, category]
# View performance metrics
agentceo metrics show --agent CodeReviewer --period 7d
# Output:
# CodeReviewer - Performance (Last 7 Days)
# =========================================
# Task Completion Time (p50): 45s
# Task Completion Time (p90): 2m 12s
# Task Success Rate: 98.2%
# Human Override Rate: 4.1%
# Review Response Time: 38s
# Issues Caught: 12 (3 high, 6 medium, 3 low)
Step 4: Set Up Cost Monitoring
Track and control agent spending:
# cost-monitoring.yaml
cost_monitoring:
tracking:
granularity: "per-task" # Track cost per task, per agent, per team
breakdown:
- compute_costs # CPU and memory
- api_costs # Claude API usage
- storage_costs # Workspace storage
- network_costs # Data transfer
budgets:
team_daily:
limit: "$50"
warning_at: "80%" # Alert at $40
action_at: "100%" # Pause non-critical agents at $50
agent_daily:
CodeReviewer: "$15"
SecurityBot: "$10"
DevOpsAgent: "$10"
BackendDev: "$15"
monthly_total:
limit: "$1000"
warning_at: "70%"
action_at: "90%"
optimization:
idle_detection:
threshold: "15m"
action: "scale-to-zero"
batch_processing:
enabled: true
batch_window: "5m"
# Group similar tasks to reduce cold starts
spot_instances:
enabled: true
fallback_to_on_demand: true
# Use spot for non-urgent tasks
# View cost breakdown
agentceo costs show --period 30d --breakdown agent
# Output:
# Cost Report (Last 30 Days)
# ==========================
# Total: $487.23
#
# Agent Compute API Storage Total
# ─────────────────────────────────────────────────────
# CodeReviewer $42.00 $112.45 $2.10 $156.55
# SecurityBot $38.00 $78.90 $5.20 $122.10
# DevOpsAgent $35.00 $52.30 $1.80 $89.10
# BackendDev $40.00 $65.48 $4.00 $109.48
# ReleaseBot $5.00 $3.50 $1.50 $10.00
#
# Savings from optimization: $82.40 (spot instances, idle detection)
For advanced cost optimization strategies, see our dedicated guide on cost optimization for AI agents.
Step 5: Create Custom Dashboards
Build dashboards tailored to different audiences:
# dashboards.yaml
dashboards:
- name: "Engineering Lead"
widgets:
- type: "fleet-status"
position: [0, 0, 6, 2]
- type: "tasks-completed-today"
position: [6, 0, 6, 2]
- type: "cost-burn-rate"
position: [0, 2, 4, 2]
- type: "top-issues-caught"
position: [4, 2, 4, 2]
- type: "human-override-log"
position: [8, 2, 4, 2]
- name: "DevOps"
widgets:
- type: "pod-health"
position: [0, 0, 6, 2]
- type: "resource-utilization"
position: [6, 0, 6, 2]
- type: "error-log"
position: [0, 2, 12, 3]
- name: "Executive"
widgets:
- type: "monthly-cost-summary"
position: [0, 0, 4, 2]
- type: "productivity-impact"
position: [4, 0, 4, 2]
- type: "roi-calculator"
position: [8, 0, 4, 2]
# Create a custom dashboard
agentceo dashboard create "Engineering Lead" \
--config dashboards.yaml
# Share dashboard with team
agentceo dashboard share "Engineering Lead" \
--team "Engineering"
Step 6: Configure Alert Routing
Set up intelligent alerting that avoids noise:
# alert-routing.yaml
alert_routing:
# Deduplicate similar alerts
deduplication:
window: "15m"
group_by: [agent_name, alert_type]
# Route alerts based on severity and type
routes:
- match:
severity: critical
notify:
- slack: "#agent-alerts"
- pagerduty: "agent-fleet"
- sms: "+1234567890"
repeat_interval: "5m"
- match:
severity: high
notify:
- slack: "#agent-alerts"
- email: "engineering-leads@company.com"
repeat_interval: "30m"
- match:
severity: medium
notify:
- slack: "#agent-status"
repeat_interval: "2h"
- match:
severity: low
type: informational
notify:
- dashboard-only: true
repeat_interval: "24h"
# Quiet hours (suppress non-critical alerts)
quiet_hours:
schedule: "22:00-08:00 UTC on weekdays, all day weekends"
suppress_below: "high"
buffer_and_send: "08:00 UTC"
# Apply alert routing
agentceo alerts configure --config alert-routing.yaml
# Test alert routing
agentceo alerts test --severity high --message "Test alert"
Step 7: Set Up Activity Logging
Maintain detailed logs of all agent actions for auditing and debugging:
# logging-config.yaml
logging:
# What to log
capture:
- task-assignments
- task-completions
- tool-invocations
- git-operations
- api-calls
- decisions-made
- errors-encountered
- escalations
# Retention
retention:
hot_storage: "7 days" # Fast query, full detail
warm_storage: "30 days" # Slower query, full detail
cold_storage: "1 year" # Archive, summary only
# Export integrations
export:
datadog:
enabled: true
api_key_ref: "secrets/datadog-api-key"
tags: ["service:agent-ceo", "env:production"]
elasticsearch:
enabled: true
url: "https://elastic.company.com:9200"
index_prefix: "agent-ceo-logs"
# Query agent activity logs
agentceo logs query \
--agent CodeReviewer \
--action "git-operations" \
--since "24h" \
--limit 50
# Export logs for analysis
agentceo logs export \
--format json \
--since "7d" \
--output agent-logs-week.json
Step 8: Monitor Agent Collaboration
Track how agents work together within teams:
# View collaboration metrics
agentceo team metrics "Core Engineering" --type collaboration
# Output:
# Team Collaboration Metrics (Last 7 Days)
# =========================================
# Inter-agent messages: 34
# Coordinated tasks: 8
# Handoffs: 12 (success rate: 100%)
# Shared context references: 45
# Conflict resolutions: 2 (both resolved autonomously)
#
# Most active collaboration:
# SecurityBot -> CodeReviewer: 15 interactions
# (Security findings triggering review blocks)
Monitoring Best Practices
-
Start with defaults: agent.ceo provides sensible default monitoring. Customize only after you understand baseline behavior.
-
Alert on symptoms, not causes: Alert when task completion rate drops, not when CPU hits 80%. The former requires action; the latter might be normal.
-
Review weekly: Schedule a 10-minute weekly review of agent metrics. Look for trends, not individual events.
-
Correlate with team productivity: Track whether agent deployment correlates with faster PR merge times, fewer production incidents, or reduced toil.
-
Set cost budgets early: It is easier to increase budgets than to explain unexpected bills.
-
Use dashboards for different audiences: Engineers need technical detail; executives need impact and ROI.
For deeper integration with enterprise monitoring tools, see our guide on real-time agent monitoring. To understand how monitoring fits into the broader platform, explore the agent.ceo architecture.
Integrating with Existing Observability
agent.ceo exports metrics in standard formats compatible with:
- Prometheus/Grafana: Scrape metrics endpoint at
/metrics - Datadog: Native integration via API key
- New Relic: OpenTelemetry export
- Splunk: HEC (HTTP Event Collector) integration
- CloudWatch/Cloud Monitoring: Cloud-native integrations
# Enable Prometheus metrics endpoint
agentceo monitoring enable prometheus \
--port 9090 \
--path /metrics
# Enable OpenTelemetry export
agentceo monitoring enable otlp \
--endpoint "https://otel-collector.company.com:4317"
GenBrain AI is the company behind agent.ceo, building the next generation of autonomous agent orchestration.
Try agent.ceo
SaaS — Get started with 1 free agent-week at agent.ceo.
Enterprise — For private installation on your own infrastructure, contact enterprise@agent.ceo.
agent.ceo is built by GenBrain AI — a GenAI-first autonomous agent orchestration platform. General inquiries: hello@agent.ceo | Security: security@agent.ceo