graph TB
    subgraph "Agent Fleet on Kubernetes"
        NS["Namespace: org-acme"]
        POD1["Pod: cto-agent<br/>2Gi mem, 1 CPU"]
        POD2["Pod: devops-agent<br/>2Gi mem, 1 CPU"]
        POD3["Pod: fullstack-agent<br/>4Gi mem, 2 CPU"]
        PV1["PVC: workspace"]
        PV2["PVC: workspace"]
        PV3["PVC: workspace"]
    end

    NS --> POD1 & POD2 & POD3
    POD1 --- PV1
    POD2 --- PV2
    POD3 --- PV3

CNCF reports Kubernetes production usage at 82% among container users, and 66% of AI adopters use Kubernetes to scale inference workloads. If your team runs infrastructure on Kubernetes, your AI agents should run there too — not in a separate SaaS silo disconnected from your existing observability, security, and deployment tooling.

This post is for platform engineering teams, DevOps leads, and SREs evaluating how to run AI agent fleets in production. It covers the operational patterns agent.ceo uses to deploy and manage agents on Kubernetes, the problems that are unique to AI workloads, and the solutions we built after running our own 6-agent fleet in production for over six months.

Why AI Agents Are Different from Traditional Workloads

Platform engineers who have spent years managing stateless microservices will find AI agents unfamiliar. Three characteristics make them operationally distinct.

Memory grows monotonically. A traditional web server allocates connection pools at startup and reaches a steady state. AI agents accumulate context with every tool call, file read, and conversation turn. A fresh agent sits at 400Mi. After two hours of active work, it can push past 2Gi. This is not a memory leak — it is the fundamental nature of context-window-based processing.

Sessions are long-running and stateful. An agent might work on a task for 30 minutes to 2 hours. If the pod is evicted, rescheduled, or OOM-killed during that window, the agent loses its entire working context. This is catastrophically different from killing a stateless web server that restarts in seconds.

Failure modes are subtle. A crashed web server stops responding to health checks. A stuck AI agent continues responding to health checks while producing nothing useful — looping on a failed approach, waiting for a tool that will never respond, or generating content that ignores its instructions due to context corruption.

The agent.ceo Kubernetes Architecture

agent.ceo deploys each agent as a dedicated Kubernetes pod within an organization namespace. This is not a shared-process model where multiple agents compete for resources in a single container. Each agent gets isolation, dedicated resources, and independent lifecycle management.

Namespace Isolation

Every organization gets its own Kubernetes namespace with network policies, resource quotas, and RBAC boundaries. Agents from different organizations cannot communicate, cannot access each other's storage, and cannot see each other's pods.

Pod Specification

Each agent pod includes the agent runtime (Claude Code CLI), persistent workspace storage via PVC, Git credentials for repository access, MCP server configurations for tool access, and NATS credentials for messaging. Resource requests and limits are tuned per agent role — a backend agent doing code generation needs more memory than a marketing agent writing content.

Persistent Volume Claims

Agent workspaces live on persistent volumes that survive pod restarts. Git repositories, session state, configuration files, and work-in-progress artifacts persist across agent lifecycles. When an agent restarts, it mounts the same workspace and resumes where it left off.

Solving the OOM Problem

The Linux OOM-killer is the single biggest operational threat to AI agents on Kubernetes. When an agent's context window pushes memory usage past the pod's limit, the kernel sends SIGKILL — no graceful shutdown, no state preservation, no callback.

agent.ceo solves this with a cgroup-aware memory governor that runs inside each pod. The governor monitors memory pressure through /sys/fs/cgroup/memory.current and escalates through three tiers before the kernel intervenes:

Tier 1 (70% memory): Compact. Trigger context compaction to reduce the working set. The agent summarizes its conversation history, freeing memory while preserving essential context.

Tier 2 (85% memory): Clear. Aggressively clear cached tool outputs and non-essential context. The agent retains its task state and critical decisions but drops detailed histories.

Tier 3 (95% memory): Archive and terminate. Write the current session state to persistent storage, mark the task as interrupted with a recovery checkpoint, and terminate gracefully. The next session can resume from the checkpoint.

This three-tier approach preserves progressively more state than a kernel OOM-kill, which preserves nothing.

Observability for Agent Workloads

Standard Kubernetes monitoring (CPU, memory, pod status) is necessary but not sufficient for AI agents. You also need agent-level observability: what is the agent doing, is it making progress, and how much is it costing?

agent.ceo exports three categories of metrics to your existing observability stack.

Infrastructure metrics (Prometheus-compatible): Pod CPU/memory usage, restart counts, PVC utilization, NATS connection state. These integrate with your existing Grafana dashboards and PagerDuty alerts.

Agent activity metrics: Tool calls per minute, tokens consumed per session, task completion rates, SLA compliance percentages. These feed the agent observability dashboard and expose agent productivity in quantitative terms.

Cost metrics: Per-agent token spend, cost per task, daily/weekly/monthly burn rate, budget utilization percentage. These feed the cost optimization pipeline and trigger alerts when spending patterns change.

graph LR
    subgraph "Agent Pods"
        A1["CTO Agent"]
        A2["DevOps Agent"]
        A3["Backend Agent"]
    end

    subgraph "Observability Stack"
        PROM["Prometheus<br/>Infra metrics"]
        NATS["NATS Audit Stream<br/>Agent activity"]
        COST["Cost Tracker<br/>Token economics"]
    end

    subgraph "Your Dashboards"
        GRAF["Grafana"]
        PD["PagerDuty"]
        CUSTOM["Custom Alerts"]
    end

    A1 & A2 & A3 --> PROM & NATS & COST
    PROM & NATS & COST --> GRAF & PD & CUSTOM

Crash Recovery and Self-Healing

Kubernetes provides pod restart on crash. agent.ceo extends this with agent-aware recovery.

When an agent pod restarts, the recovery sequence: mount persistent workspace, load last session checkpoint from Firestore, reconnect to NATS and process queued inbox messages, resume interrupted task from checkpoint or start fresh depending on interruption severity. The state recovery patterns ensure that a pod restart costs minutes of lost context, not hours of lost work.

For infrastructure-level self-healing, the platform monitors agent health beyond simple liveness probes. An agent that is running but unproductive (zero tool calls for 10+ minutes during an active task) triggers investigation. An agent that is cycling through restart loops gets its task reassigned to a healthy agent in the same role pool.

Deployment Models

agent.ceo supports two deployment models for platform teams.

SaaS (managed): We run the Kubernetes infrastructure. Your agents deploy to our GKE clusters with namespace isolation, managed NATS, and Firestore state storage. Best for teams that want to start fast without infrastructure overhead.

Private installation: agent.ceo deploys to your Kubernetes cluster — GKE, EKS, AKS, or bare metal. You control the infrastructure, network policies, storage, and secrets management. Your agents never leave your network. Best for enterprises with data residency requirements, air-gapped environments, or existing Kubernetes operations teams.

Both models use the same agent runtime, the same MCP tool ecosystem, and the same management APIs. Migration between them is a configuration change, not a rewrite.

Try It

If you run a platform engineering or DevOps team and your organization is deploying AI agents, agent.ceo gives you the Kubernetes-native control plane to manage them with the same rigor you apply to every other production workload.

SaaS — 100 free agent-hours at agent.ceo.

Enterprise — Private installation on your cluster. Contact enterprise@agent.ceo.

Design partners — We are working with platform engineering teams building agent infrastructure. Reach out at hello@agent.ceo.

agent.ceo is built by GenBrain AI — a GenAI-first autonomous agent orchestration platform. General inquiries: hello@agent.ceo | Security: security@agent.ceo

Kubernetes for AI Agents: What Platform Engineering Teams Need to Know