You would never run a production service without metrics, logs, and alerts. But most teams deploying AI agents treat them like magic boxes -- fire a task, hope it completes, check the output manually. In a Cyborgenic Organization, where AI agents hold real operational roles, that falls apart on day one.

At GenBrain AI, we run six agents 24/7 through agent.ceo. This tutorial walks through the observability stack we built to keep that fleet healthy, accountable, and cost-efficient.

What You Need to Measure (and Why)

Agent observability has four layers. Skip any one of them and you will have blind spots that cost you money, quality, or both.

Layer 1: Health and Availability

The baseline: is the agent running, responsive, and able to accept work?

Metric	What It Tells You	Alert Threshold
`agent_heartbeat_age_seconds`	Time since last heartbeat	> 120s
`agent_process_status`	Running / crashed / OOM	!= running
`agent_restart_count_total`	Crash frequency	> 3 in 1 hour
`agent_mcp_connection_status`	Tool server connectivity	!= connected

These catch the obvious failures: crashes, OOM kills, lost MCP connections. In month one, we were losing 2.1% of uptime to OOM kills. Once we had visibility into restart counts, we added memory limits and dropped downtime to 0.3%.

Layer 2: Task Performance

Health tells you the agent is alive. Performance tells you it is doing good work.

Metric	What It Tells You	Alert Threshold
`agent_task_completion_rate`	% of tasks completed successfully	< 90% over 1h
`agent_task_duration_seconds`	How long tasks take	> 2x p95 baseline
`agent_task_retries_total`	First-pass failure frequency	> 20% retry rate
`agent_verification_pass_rate`	Quality of completed work	< 85%
`agent_sla_compliance_ratio`	SLA adherence	< 97%

The agent_verification_pass_rate metric -- the percentage of tasks that pass automated verification on the first attempt -- is the closest equivalent to an error rate for knowledge work. At GenBrain AI, our fleet-wide first-pass quality sits at 87.2%. We target 90%. The gap is visible because we measure it.

Layer 3: Resource Consumption and Cost

AI agents consume three expensive resources: LLM tokens, compute time, and context window capacity. If you are not tracking these, you are guessing at your unit economics.

Metric	What It Tells You	Alert Threshold
`agent_tokens_consumed_total`	LLM API usage	> 2x daily baseline
`agent_cost_dollars_total`	Dollar cost per agent per day	> $20/day
`agent_context_window_usage_ratio`	How full the context window is	> 85%
`agent_compaction_count_total`	Context compaction frequency	> 5 per task
`agent_cost_per_task_dollars`	Unit economics	> $1.00

Context window usage is the metric most teams miss. When an agent's context fills up, it compacts -- summarizing earlier context to make room. Compaction is lossy. An agent that compacts 8 times during a single task is more likely to hallucinate or lose track of its objective. We alert at 85% because that is where we have empirically observed quality degradation.

Our current cost per task is $0.37 across the fleet. Without this metric, we would have no idea whether our cost optimizations were actually working.

Layer 4: Organizational Health

Individual agent metrics are necessary but insufficient. You also need fleet-wide visibility into how the agents are working together.

Metric	What It Tells You	Alert Threshold
`agent_inbox_depth`	Backlog of unprocessed tasks	> 10 messages
`agent_blocked_task_count`	Tasks waiting on dependencies	> 3 per agent
`agent_meeting_duration_seconds`	Agent coordination overhead	> 600s
`agent_escalation_rate`	How often agents escalate to humans	> 15%
`agent_cross_agent_message_rate`	Inter-agent communication volume	Anomaly detection

Inbox depth is the canary metric. When it grows faster than the agent can process, either the agent is stuck or the tasks are too complex. We have seen inbox depth spikes predict agent failures 15-20 minutes before any other metric moves.

The Stack: Prometheus + Grafana + NATS

Our observability stack uses three components, all open-source.

Prometheus: Metric Collection

Each agent exposes a /metrics endpoint via a lightweight HTTP sidecar. Prometheus scrapes these endpoints every 15 seconds.

# prometheus.yml - agent fleet scrape config
scrape_configs:
  - job_name: 'agent-fleet'
    scrape_interval: 15s
    static_configs:
      - targets:
        - 'ceo-agent:9090'
        - 'cto-agent:9090'
        - 'devops-agent:9090'
        - 'security-agent:9090'
        - 'marketing-agent:9090'
        - 'fullstack-agent:9090'
    relabel_configs:
      - source_labels: [__address__]
        regex: '(.+)-agent:.+'
        target_label: agent_role
        replacement: '${1}'

NATS: Event Bus for Real-Time Metrics

Beyond raw metrics, you need events: task started, task completed, task failed, agent restarted, SLA breached. These flow through NATS JetStream, giving us ordered, persistent, replayable event streams.

# NATS subjects for agent observability
agent.*.heartbeat        # Health pings (every 30s)
agent.*.task.started      # Task lifecycle events
agent.*.task.completed
agent.*.task.failed
agent.*.sla.breach        # SLA violation alerts
agent.*.escalation        # Human escalation events
agent.*.crash             # Crash and restart events

Wildcard patterns let us subscribe to all agents or filter to a specific one. A metrics bridge subscribes to agent.> and maintains Prometheus counters. A separate alerting service subscribes to agent.*.sla.breach and agent.*.crash for immediate notification.

Grafana: Dashboards

We run four dashboards:

Fleet Overview. One row per agent showing status, current task, completion rate (24h), SLA compliance (7d), cost (today), and context usage. The "is everything okay?" dashboard.

Agent Detail. Per-agent deep dive: task timeline, token consumption, context window usage, retry rate trend, and the last 10 completed tasks with verification status.

Cost Dashboard. Daily and monthly cost by agent, cost per task trend, token consumption by model, and projected spend. This is how we caught our CTO agent consuming 40% more tokens than the others.

SLA Dashboard. Compliance by agent, task type, and time period. Breach log with root cause classification. Feeds directly into our quarterly report cards.

Alerting Rules That Actually Matter

Alert fatigue is real. Here are the seven alerts we run in production, each chosen because ignoring it caused a real problem.

# Grafana alert rules (simplified)
groups:
  - name: agent-critical
    rules:
      - alert: AgentDown
        expr: time() - agent_heartbeat_timestamp > 120
        for: 2m
        labels:
          severity: critical

      - alert: AgentCrashLoop
        expr: increase(agent_restart_count_total[1h]) > 3
        for: 0m
        labels:
          severity: critical

      - alert: SLABreach
        expr: agent_sla_compliance_ratio < 0.95
        for: 15m
        labels:
          severity: warning

      - alert: ContextWindowCritical
        expr: agent_context_window_usage_ratio > 0.85
        for: 5m
        labels:
          severity: warning

      - alert: CostAnomaly
        expr: agent_cost_dollars_total > 20
        for: 0m
        labels:
          severity: warning

      - alert: InboxBacklog
        expr: agent_inbox_depth > 10
        for: 10m
        labels:
          severity: warning

      - alert: HighRetryRate
        expr: agent_task_retries_total / agent_task_completion_total > 0.2
        for: 30m
        labels:
          severity: warning

Seven alerts. Not seventy. Each one maps to a specific remediation action. AgentDown means restart the agent. ContextWindowCritical means the current task should be broken into subtasks. CostAnomaly means review what the agent is working on -- it might be stuck in a retry loop burning tokens.

Practical Setup: From Zero to Observable in 30 Minutes

If you deploy through agent.ceo, the SaaS platform includes built-in observability with the monitoring dashboard out of the box.

For self-hosted deployments, five steps:

Deploy the metrics sidecar -- each agent container gets a lightweight sidecar exposing /metrics (agentceo deploy --with-metrics --agent cto)
Configure Prometheus scraping -- point at your agent fleet endpoints using the config above (use ServiceMonitor CRDs on Kubernetes)
Import Grafana dashboards -- we publish dashboard JSON templates in the agent.ceo docs
Configure alert routing -- critical alerts to Slack/PagerDuty, warnings to daily digest
Set SLA baselines -- run one week without alerts, then enable SLA enforcement with realistic baselines

Lessons from Six Months of Agent Observability

Context window usage is your leading indicator. When context fills up, everything degrades: quality, retries, task duration. Watch this metric first.

Cost per task beats total cost. Ours dropped from $0.52 to $0.37 over six months -- a 29% improvement from prompt optimization and skill transfer.

Inbox depth predicts failures. A growing inbox means work arrives faster than it is processed. This signal leads agent failures by 15-20 minutes.

Seven alerts is the right number. We started with 23, silenced most within a week. The survivors map to real problems with clear remediation steps. Everything else is dashboard material.

Try agent.ceo

GenBrain AI built this observability stack because running a Cyborgenic Organization without visibility is flying blind. With agent.ceo, you get production-grade observability out of the box -- fleet dashboards, SLA tracking, cost monitoring, and intelligent alerting.

SaaS: Deploy your first observable agent fleet at agent.ceo. Monitoring included, no configuration required.

Enterprise: Need custom metrics, on-premise Grafana integration, or compliance-grade audit logging? Contact enterprise@agent.ceo.

Building an Observability Stack for Your AI Agent Fleet