The Three Pillars for Agent Observability
Monitoring AI agents differs fundamentally from monitoring traditional microservices. A web service is either up or down, responding fast or slow. An AI agent might be running, but stuck in a reasoning loop. It might be active, but working on the wrong task. It might appear healthy by all system metrics while producing incorrect output. Effective agent observability requires layered instrumentation: infrastructure metrics, application-level telemetry, and semantic health indicators that capture whether agents are making meaningful progress.
Traditional observability rests on metrics, logs, and traces. For AI agents, we add a fourth pillar — progress signals — that captures whether an agent is productively advancing toward its goal:
Rendering diagram…
- Metrics — CPU, memory, pod status, request rates
- Logs — Structured event streams from agent execution
- Traces — Distributed tracing across agent interactions
- Progress signals — Task advancement, output quality, goal completion rate
Prometheus Metrics for Agent Workloads
We instrument every agent pod with a metrics exporter that exposes both system and semantic metrics:
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: agent-metrics
namespace: platform-services
labels:
app: agent-worker
spec:
selector:
matchLabels:
app: agent-worker
namespaceSelector:
matchNames:
- org-*
endpoints:
- port: metrics
interval: 15s
path: /metrics
relabelings:
- sourceLabels: [__meta_kubernetes_namespace]
targetLabel: org_namespace
- sourceLabels: [__meta_kubernetes_pod_label_org]
targetLabel: org_id
- sourceLabels: [__meta_kubernetes_pod_label_agent]
targetLabel: agent_name
The agent metrics exporter runs as a sidecar in each pod:
import { Counter, Gauge, Histogram, Registry, collectDefaultMetrics } from 'prom-client';
import express from 'express';
const registry = new Registry();
collectDefaultMetrics({ register: registry });
// Agent-specific metrics
const taskCompletionTotal = new Counter({
name: 'agent_tasks_completed_total',
help: 'Total tasks completed by this agent',
labelNames: ['status', 'priority'],
registers: [registry]
});
const taskDurationSeconds = new Histogram({
name: 'agent_task_duration_seconds',
help: 'Time to complete a task',
labelNames: ['task_type', 'complexity'],
buckets: [30, 60, 120, 300, 600, 1800, 3600],
registers: [registry]
});
const agentState = new Gauge({
name: 'agent_state',
help: 'Current agent state (1=active, 0.5=thinking, 0=idle)',
registers: [registry]
});
const toolCallsTotal = new Counter({
name: 'agent_tool_calls_total',
help: 'Total tool invocations',
labelNames: ['tool_name', 'result'],
registers: [registry]
});
const tokenUsageTotal = new Counter({
name: 'agent_token_usage_total',
help: 'Total tokens consumed',
labelNames: ['direction'], // input, output
registers: [registry]
});
const idleSeconds = new Gauge({
name: 'agent_idle_seconds',
help: 'Seconds since last meaningful activity',
registers: [registry]
});
// Expose metrics endpoint
const app = express();
app.get('/metrics', async (req, res) => {
res.set('Content-Type', registry.contentType);
res.end(await registry.metrics());
});
app.listen(9090);
Key Monitoring Dashboards
We build Grafana dashboards that give both platform operators and customers visibility into their agents. Here are the essential PromQL queries:
# Fleet-wide agent health score
# Ratio of agents making progress vs total running agents
sum(agent_state > 0) / count(agent_state) * 100
# Task completion rate (tasks per hour per agent)
sum by (agent_name) (
rate(agent_tasks_completed_total{status="success"}[1h])
) * 3600
# Average task duration by type
histogram_quantile(0.95,
sum by (le, task_type) (
rate(agent_task_duration_seconds_bucket[5m])
)
)
# Token burn rate (cost proxy) per organization
sum by (org_id) (
rate(agent_token_usage_total[5m])
) * 300
# Stuck agent detection — agents idle for more than 10 minutes
# while having assigned tasks
agent_idle_seconds > 600
and on(agent_name, org_id) agent_assigned_tasks > 0
# Resource utilization efficiency
sum by (org_namespace) (
container_cpu_usage_seconds_total{container="claude-agent"}
) / sum by (org_namespace) (
kube_pod_container_resource_requests{container="claude-agent", resource="cpu"}
)
# Error rate by tool
sum by (tool_name) (
rate(agent_tool_calls_total{result="error"}[5m])
) / sum by (tool_name) (
rate(agent_tool_calls_total[5m])
) * 100
Structured Logging Pipeline
Agent logs are structured JSON, shipped through Fluent Bit to Cloud Logging with agent context attached:
# Fluent Bit ConfigMap
apiVersion: v1
kind: ConfigMap
metadata:
name: fluent-bit-config
namespace: platform-services
data:
fluent-bit.conf: |
[SERVICE]
Flush 5
Log_Level info
Parsers_File parsers.conf
[INPUT]
Name tail
Tag agent.*
Path /var/log/containers/agent-*.log
Parser docker
Refresh_Interval 10
Mem_Buf_Limit 5MB
[FILTER]
Name kubernetes
Match agent.*
Kube_Tag_Prefix agent.var.log.containers.
Merge_Log On
K8S-Logging.Parser On
[FILTER]
Name modify
Match agent.*
Add platform agent-ceo
Add environment production
[OUTPUT]
Name stackdriver
Match agent.*
Resource k8s_container
k8s_cluster_name agent-ceo-prod
k8s_cluster_location us-central1
Agent code emits structured log events that capture semantic context:
import { createLogger, format, transports } from 'winston';
const logger = createLogger({
format: format.combine(
format.timestamp(),
format.json()
),
defaultMeta: {
agentId: process.env.AGENT_ID,
orgId: process.env.ORG_ID,
service: 'agent-worker'
},
transports: [new transports.Console()]
});
// Structured agent activity logging
function logTaskProgress(task: Task, phase: string, details: Record<string, any>) {
logger.info('task_progress', {
taskId: task.id,
phase, // planning, executing, reviewing, complete
progress: task.progress,
toolsUsed: details.tools || [],
filesModified: details.files || [],
tokensUsed: details.tokens || 0,
elapsedSeconds: (Date.now() - task.startedAt) / 1000
});
}
// Tool call logging with latency
function logToolCall(tool: string, duration: number, success: boolean, error?: string) {
logger.info('tool_call', {
tool,
durationMs: duration,
success,
error: error || undefined
});
// Update Prometheus metrics
toolCallsTotal.inc({ tool_name: tool, result: success ? 'success' : 'error' });
}
Alerting Rules
Alerting for agents requires understanding the difference between "unhealthy" and "unproductive." A pod crash is an infrastructure alert. An agent spending 30 minutes on a task that should take 5 is a semantic alert:
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: agent-alerts
namespace: platform-services
spec:
groups:
- name: agent-health
interval: 30s
rules:
# Infrastructure alerts
- alert: AgentPodCrashLooping
expr: |
rate(kube_pod_container_status_restarts_total{
container="claude-agent"
}[15m]) > 0.1
for: 5m
labels:
severity: critical
category: infrastructure
annotations:
summary: "Agent {{ $labels.pod }} is crash looping"
runbook: "Check pod logs and resource limits"
- alert: AgentHighMemoryUsage
expr: |
container_memory_usage_bytes{container="claude-agent"}
/ container_spec_memory_limit_bytes{container="claude-agent"}
> 0.9
for: 5m
labels:
severity: warning
category: infrastructure
annotations:
summary: "Agent {{ $labels.pod }} memory usage above 90%"
# Semantic alerts — agent behavior anomalies
- alert: AgentStuck
expr: |
agent_idle_seconds > 900
and on(pod) kube_pod_status_phase{phase="Running"} == 1
for: 5m
labels:
severity: warning
category: semantic
annotations:
summary: "Agent {{ $labels.agent_name }} appears stuck (15min idle)"
action: "Check if agent is in a reasoning loop or waiting for input"
- alert: AgentHighErrorRate
expr: |
sum by (agent_name, org_id) (
rate(agent_tool_calls_total{result="error"}[10m])
) / sum by (agent_name, org_id) (
rate(agent_tool_calls_total[10m])
) > 0.3
for: 5m
labels:
severity: warning
category: semantic
annotations:
summary: "Agent {{ $labels.agent_name }} has >30% tool error rate"
- alert: AgentExcessiveTokenBurn
expr: |
sum by (agent_name, org_id) (
rate(agent_token_usage_total[5m])
) > 10000
for: 10m
labels:
severity: warning
category: cost
annotations:
summary: "Agent {{ $labels.agent_name }} burning tokens at unusual rate"
# Tenant-level alerts
- alert: OrgAgentQuotaNearLimit
expr: |
count by (org_id) (
kube_pod_status_phase{phase="Running", container="claude-agent"}
) / on(org_id) group_left() agent_org_quota > 0.9
for: 1m
labels:
severity: info
category: capacity
annotations:
summary: "Organization {{ $labels.org_id }} approaching agent quota"
Distributed Tracing Across Agent Interactions
When agents delegate tasks to other agents, we need distributed tracing to follow the request chain. We use OpenTelemetry with NATS propagation:
import { trace, context, propagation, SpanKind } from '@opentelemetry/api';
const tracer = trace.getTracer('agent-worker');
async function delegateToAgent(targetAgent: string, task: TaskPayload) {
const span = tracer.startSpan('delegate_task', {
kind: SpanKind.PRODUCER,
attributes: {
'agent.target': targetAgent,
'task.type': task.type,
'task.priority': task.priority
}
});
// Inject trace context into NATS message headers
const headers = {};
propagation.inject(context.active(), headers);
nc.publish(`org.${orgId}.tasks.${targetAgent}.inbox`, sc.encode(JSON.stringify({
...task,
traceHeaders: headers
})));
span.end();
}
This observability stack connects directly to our cost optimization engine, which uses these metrics to identify idle agents for scale-to-zero. The monitoring your AI agent fleet tutorial provides a step-by-step setup guide for customers.
For teams exploring self-healing infrastructure, these monitoring signals feed into automated remediation — restarting stuck agents, scaling capacity during burst periods, and alerting operators when semantic anomalies indicate deeper issues.
Continue reading: Explore the architecture behind agent.ceo, learn about scaling AI agents to 100 concurrent workers, or get started with our 5-minute quickstart guide.
agent.ceo is a GenAI-first autonomous agent orchestration platform built by GenBrain AI.
Try agent.ceo
SaaS — Get started with 1 free agent-week at agent.ceo.
Enterprise — For private installation on your own infrastructure, contact enterprise@agent.ceo.
agent.ceo is built by GenBrain AI — a GenAI-first autonomous agent orchestration platform. General inquiries: hello@agent.ceo | Security: security@agent.ceo