Skip to main content
Back to blog
Technical8 min read

Cost Optimization for AI Agent Workloads

M
Moshe Beeri, Founder
/
cost-optimizationscale-to-zerospot-instancesresource-managementkubernetesai-agentsfinops

The Cost Problem

AI agents are expensive to run. Each agent consumes dedicated CPU, memory, and often GPU resources for extended periods. Unlike stateless API endpoints that handle requests in milliseconds, agents hold resources for minutes to hours while thinking, coding, and iterating. Without deliberate cost optimization, a fleet of 50 agents on standard GKE nodes would cost $15,000-$25,000/month in compute alone. At agent.ceo, we reduced this by 70% through a combination of scale-to-zero, spot instances, intelligent scheduling, and resource right-sizing. Here is exactly how.

AI agent workloads have a distinctive resource utilization pattern. During active work, agents burst to high CPU and memory usage. Between tasks, they sit idle consuming baseline resources. For most organizations, agents are actively working only 20-40% of the time. The remaining 60-80% is wasted spend:

Rendering diagram…

Traditional Kubernetes deployments maintain pods regardless of utilization. Reserving 2 CPU cores and 8GB RAM per agent 24/7 when they are productive only 8 hours/day means you are paying 3x more than necessary.

Strategy 1: Scale-to-Zero

The highest-impact optimization is pausing agents when they have no work. We monitor agent activity and terminate idle pods after a configurable threshold, preserving workspace state for rapid resume:

import { KubeConfig, CoreV1Api } from '@kubernetes/client-node';
import { getFirestore, FieldValue } from 'firebase-admin/firestore';

const db = getFirestore();
const k8s = new KubeConfig();
k8s.loadFromCluster();
const coreApi = k8s.makeApiClient(CoreV1Api);

interface ScaleToZeroConfig {
  idleThresholdMinutes: number;    // Default: 15
  preserveWorkspace: boolean;       // Default: true
  warmStartEnabled: boolean;        // Default: true
}

class ScaleToZeroController {
  private config: ScaleToZeroConfig;

  constructor(config: ScaleToZeroConfig) {
    this.config = config;
  }

  /**
   * Runs every minute. Identifies idle agents and scales them to zero.
   */
  async checkIdleAgents(): Promise<void> {
    const orgs = await db.collectionGroup('agents')
      .where('status', '==', 'running')
      .get();

    const now = Date.now();
    const threshold = this.config.idleThresholdMinutes * 60 * 1000;

    for (const doc of orgs.docs) {
      const agent = doc.data();
      const lastActive = agent.lastActiveAt?.toMillis() || 0;
      const idleDuration = now - lastActive;

      if (idleDuration > threshold) {
        const orgId = doc.ref.parent.parent.id;
        const agentId = doc.id;

        console.log(`Scaling to zero: ${agentId} (idle ${Math.floor(idleDuration / 60000)}min)`);
        await this.scaleDown(orgId, agentId);
      }
    }
  }

  private async scaleDown(orgId: string, agentId: string): Promise<void> {
    const namespace = `org-${orgId}`;

    // Checkpoint agent state
    if (this.config.preserveWorkspace) {
      await this.createCheckpoint(orgId, agentId);
    }

    // Delete pod (PVC remains)
    try {
      await coreApi.deleteNamespacedPod(agentId, namespace, undefined, undefined, 30);
    } catch (e) {
      if (e.statusCode !== 404) throw e;
    }

    // Update Firestore
    await db.doc(`organizations/${orgId}/agents/${agentId}`).update({
      status: 'paused',
      pausedAt: FieldValue.serverTimestamp(),
      pauseReason: 'idle_scale_to_zero'
    });
  }

  /**
   * Resume agent when new task arrives.
   * Target: <30 seconds to ready state.
   */
  async warmStart(orgId: string, agentId: string): Promise<void> {
    const agentDoc = await db.doc(`organizations/${orgId}/agents/${agentId}`).get();
    const agent = agentDoc.data();

    if (agent.status !== 'paused') return;

    const namespace = `org-${orgId}`;

    // Recreate pod with existing PVC for instant workspace access
    const pod = buildAgentPod(agentId, orgId, agent, {
      existingPvc: `workspace-${agentId}`,
      checkpoint: agent.lastCheckpoint
    });

    await coreApi.createNamespacedPod(namespace, pod);

    await agentDoc.ref.update({
      status: 'running',
      resumedAt: FieldValue.serverTimestamp()
    });
  }

  private async createCheckpoint(orgId: string, agentId: string): Promise<void> {
    // Store agent's current task context and conversation state
    await db.doc(`organizations/${orgId}/agents/${agentId}`).update({
      lastCheckpoint: {
        timestamp: FieldValue.serverTimestamp(),
        taskContext: await this.getAgentContext(orgId, agentId),
        conversationLength: await this.getConversationLength(orgId, agentId)
      }
    });
  }
}

// Cron job: run idle check every minute
setInterval(() => controller.checkIdleAgents(), 60000);

Scale-to-zero typically saves 60-80% on compute costs for organizations with standard business-hours usage patterns. The tradeoff is 15-30 seconds of cold start time when an agent resumes.

Strategy 2: Spot Instances and Preemptible Nodes

GKE Spot VMs cost 60-91% less than standard instances. Agent workloads tolerate preemption well because we checkpoint state before termination:

apiVersion: v1
kind: Pod
metadata:
  name: agent-worker
spec:
  # Prefer spot nodes for cost savings
  nodeSelector:
    cloud.google.com/gke-spot: "true"
  tolerations:
    - key: cloud.google.com/gke-spot
      operator: Equal
      value: "true"
      effect: NoSchedule
  # Graceful handling of preemption
  terminationGracePeriodSeconds: 60
  containers:
    - name: claude-agent
      image: gcr.io/agent-ceo/claude-agent:stable
      lifecycle:
        preStop:
          exec:
            command:
              - /bin/sh
              - -c
              - |
                # Flush metrics and billing
                curl -X POST http://localhost:9090/flush
                # Save conversation state
                curl -X POST http://localhost:8080/checkpoint
                # Signal graceful shutdown
                kill -SIGTERM 1
      resources:
        requests:
          cpu: "500m"
          memory: "2Gi"
        limits:
          cpu: "2000m"
          memory: "8Gi"
---
# Priority class for critical agents that should NOT use spot
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: agent-critical
value: 1000000
globalDefault: false
description: "Critical agents that must not be preempted"
---
# Priority class for standard agents on spot nodes
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: agent-standard
value: 100000
globalDefault: true
description: "Standard agents, tolerant of preemption"

We maintain a small pool of on-demand nodes for agents marked as critical (e.g., production deployment agents), while routing all other workloads to spot:

# Node pool configuration (Terraform)
resource "google_container_node_pool" "spot_agents" {
  name       = "spot-agents"
  cluster    = google_container_cluster.agent_platform.name
  location   = var.region

  autoscaling {
    min_node_count = 0
    max_node_count = 50
  }

  node_config {
    spot         = true
    machine_type = "e2-standard-4"  # 4 vCPU, 16 GB RAM

    labels = {
      "cloud.google.com/gke-spot" = "true"
      "workload-type"             = "agent"
    }

    taint {
      key    = "cloud.google.com/gke-spot"
      value  = "true"
      effect = "NO_SCHEDULE"
    }
  }
}

resource "google_container_node_pool" "ondemand_critical" {
  name       = "ondemand-critical"
  cluster    = google_container_cluster.agent_platform.name
  location   = var.region

  autoscaling {
    min_node_count = 1
    max_node_count = 10
  }

  node_config {
    spot         = false
    machine_type = "e2-standard-4"

    labels = {
      "workload-type" = "agent-critical"
    }
  }
}

Strategy 3: Resource Right-Sizing

Most agents do not need their full resource allocation most of the time. We use Vertical Pod Autoscaler (VPA) recommendations to right-size agents based on actual usage:

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: agent-vpa
  namespace: org-acme-corp
spec:
  targetRef:
    apiVersion: "v1"
    kind: Pod
    name: agent-*
  updatePolicy:
    updateMode: "Off"  # Recommend only, don't auto-apply
  resourcePolicy:
    containerPolicies:
      - containerName: claude-agent
        minAllowed:
          cpu: "250m"
          memory: "1Gi"
        maxAllowed:
          cpu: "4000m"
          memory: "16Gi"

We collect VPA recommendations and apply them during agent restarts:

async function getOptimalResources(orgId: string, agentId: string): Promise<ResourceSpec> {
  // Fetch actual usage from Prometheus
  const cpuP95 = await queryPrometheus(`
    quantile_over_time(0.95,
      container_cpu_usage_seconds_total{
        pod="${agentId}",
        namespace="org-${orgId}"
      }[7d]
    )
  `);

  const memP95 = await queryPrometheus(`
    quantile_over_time(0.95,
      container_memory_usage_bytes{
        pod="${agentId}",
        namespace="org-${orgId}"
      }[7d]
    )
  `);

  // Add 20% buffer above P95
  return {
    requests: {
      cpu: `${Math.ceil(cpuP95 * 1.2 * 1000)}m`,
      memory: `${Math.ceil(memP95 * 1.2 / (1024 * 1024 * 1024))}Gi`
    },
    limits: {
      cpu: `${Math.ceil(cpuP95 * 3 * 1000)}m`,    // 3x for burst
      memory: `${Math.ceil(memP95 * 2 / (1024 * 1024 * 1024))}Gi`
    }
  };
}

Strategy 4: Intelligent Scheduling

Not all tasks need to run immediately. Batch workloads can be scheduled during off-peak hours when spot instance availability is higher and costs are lower:

class CostAwareScheduler {
  async scheduleTask(task: Task, orgId: string): Promise<void> {
    const org = await db.doc(`organizations/${orgId}`).get();
    const orgData = org.data();

    // Check if task can be deferred for cost savings
    if (task.priority === 'low' && !task.deadline) {
      const currentSpotPrice = await getSpotPricing();
      const avgPrice = await getAverageSpotPrice(7); // 7-day average

      if (currentSpotPrice > avgPrice * 1.3) {
        // Spot prices are 30% above average — defer
        const optimalTime = await predictLowCostWindow(4); // Next 4 hours
        task.scheduledFor = optimalTime;

        await db.collection(`organizations/${orgId}/scheduledTasks`).add({
          ...task,
          reason: 'cost_optimization',
          estimatedSavings: `${Math.round((1 - avgPrice / currentSpotPrice) * 100)}%`
        });

        return;
      }
    }

    // Immediate execution
    await dispatchToAgent(task, orgId);
  }
}

Cost Dashboard

We provide customers with real-time cost visibility so they can make informed decisions about their agent fleet:

# Current hourly burn rate per organization
sum by (org_id) (
  (container_cpu_usage_seconds_total{container="claude-agent"} * 0.032)  # CPU cost/hour
  +
  (container_memory_usage_bytes{container="claude-agent"} / 1073741824 * 0.004)  # Memory cost/hour/GB
)

# Projected monthly cost
sum by (org_id) (
  rate(agent_compute_cost_dollars_total[1h])
) * 720  # Hours in a month

# Cost savings from scale-to-zero
sum by (org_id) (
  agent_paused_hours_total * 0.08  # Cost per agent-hour avoided
)

# Spot vs on-demand ratio
count by (org_id) (
  kube_pod_info{node=~".*spot.*", container="claude-agent"}
) / count by (org_id) (
  kube_pod_info{container="claude-agent"}
) * 100

Results

Combining all four strategies, here is the cost breakdown for a typical 20-agent fleet:

StrategyMonthly SavingsImplementation Effort
Scale-to-zero$4,800 (60%)Medium
Spot instances$2,400 (65% of remaining)Low
Right-sizing$800 (20% of remaining)Low
Intelligent scheduling$400 (variable)Medium
Total$8,400/month

From a baseline of ~$12,000/month for 20 always-on agents, we reduce to ~$3,600/month — a 70% reduction.

These optimizations work in concert with our Stripe billing integration to pass savings through to customers. The real-time monitoring stack provides the utilization signals that drive scaling decisions.

For teams managing their own agent infrastructure, the Kubernetes deployment guide covers the foundational setup, while our scaling AI agents post addresses horizontal scaling when your optimized fleet needs to grow.

Continue reading: Explore the architecture behind agent.ceo, learn about scaling AI agents to 100 concurrent workers, or get started with our 5-minute quickstart guide.

agent.ceo offers both SaaS and enterprise private installation options for organizations of any size.

Try agent.ceo

SaaS — Get started with 1 free agent-week at agent.ceo.

Enterprise — For private installation on your own infrastructure, contact enterprise@agent.ceo.


agent.ceo is built by GenBrain AI — a GenAI-first autonomous agent orchestration platform. General inquiries: hello@agent.ceo | Security: security@agent.ceo

Related articles