The Cost Problem
AI agents are expensive to run. Each agent consumes dedicated CPU, memory, and often GPU resources for extended periods. Unlike stateless API endpoints that handle requests in milliseconds, agents hold resources for minutes to hours while thinking, coding, and iterating. Without deliberate cost optimization, a fleet of 50 agents on standard GKE nodes would cost $15,000-$25,000/month in compute alone. At agent.ceo, we reduced this by 70% through a combination of scale-to-zero, spot instances, intelligent scheduling, and resource right-sizing. Here is exactly how.
AI agent workloads have a distinctive resource utilization pattern. During active work, agents burst to high CPU and memory usage. Between tasks, they sit idle consuming baseline resources. For most organizations, agents are actively working only 20-40% of the time. The remaining 60-80% is wasted spend:
Rendering diagram…
Traditional Kubernetes deployments maintain pods regardless of utilization. Reserving 2 CPU cores and 8GB RAM per agent 24/7 when they are productive only 8 hours/day means you are paying 3x more than necessary.
Strategy 1: Scale-to-Zero
The highest-impact optimization is pausing agents when they have no work. We monitor agent activity and terminate idle pods after a configurable threshold, preserving workspace state for rapid resume:
import { KubeConfig, CoreV1Api } from '@kubernetes/client-node';
import { getFirestore, FieldValue } from 'firebase-admin/firestore';
const db = getFirestore();
const k8s = new KubeConfig();
k8s.loadFromCluster();
const coreApi = k8s.makeApiClient(CoreV1Api);
interface ScaleToZeroConfig {
idleThresholdMinutes: number; // Default: 15
preserveWorkspace: boolean; // Default: true
warmStartEnabled: boolean; // Default: true
}
class ScaleToZeroController {
private config: ScaleToZeroConfig;
constructor(config: ScaleToZeroConfig) {
this.config = config;
}
/**
* Runs every minute. Identifies idle agents and scales them to zero.
*/
async checkIdleAgents(): Promise<void> {
const orgs = await db.collectionGroup('agents')
.where('status', '==', 'running')
.get();
const now = Date.now();
const threshold = this.config.idleThresholdMinutes * 60 * 1000;
for (const doc of orgs.docs) {
const agent = doc.data();
const lastActive = agent.lastActiveAt?.toMillis() || 0;
const idleDuration = now - lastActive;
if (idleDuration > threshold) {
const orgId = doc.ref.parent.parent.id;
const agentId = doc.id;
console.log(`Scaling to zero: ${agentId} (idle ${Math.floor(idleDuration / 60000)}min)`);
await this.scaleDown(orgId, agentId);
}
}
}
private async scaleDown(orgId: string, agentId: string): Promise<void> {
const namespace = `org-${orgId}`;
// Checkpoint agent state
if (this.config.preserveWorkspace) {
await this.createCheckpoint(orgId, agentId);
}
// Delete pod (PVC remains)
try {
await coreApi.deleteNamespacedPod(agentId, namespace, undefined, undefined, 30);
} catch (e) {
if (e.statusCode !== 404) throw e;
}
// Update Firestore
await db.doc(`organizations/${orgId}/agents/${agentId}`).update({
status: 'paused',
pausedAt: FieldValue.serverTimestamp(),
pauseReason: 'idle_scale_to_zero'
});
}
/**
* Resume agent when new task arrives.
* Target: <30 seconds to ready state.
*/
async warmStart(orgId: string, agentId: string): Promise<void> {
const agentDoc = await db.doc(`organizations/${orgId}/agents/${agentId}`).get();
const agent = agentDoc.data();
if (agent.status !== 'paused') return;
const namespace = `org-${orgId}`;
// Recreate pod with existing PVC for instant workspace access
const pod = buildAgentPod(agentId, orgId, agent, {
existingPvc: `workspace-${agentId}`,
checkpoint: agent.lastCheckpoint
});
await coreApi.createNamespacedPod(namespace, pod);
await agentDoc.ref.update({
status: 'running',
resumedAt: FieldValue.serverTimestamp()
});
}
private async createCheckpoint(orgId: string, agentId: string): Promise<void> {
// Store agent's current task context and conversation state
await db.doc(`organizations/${orgId}/agents/${agentId}`).update({
lastCheckpoint: {
timestamp: FieldValue.serverTimestamp(),
taskContext: await this.getAgentContext(orgId, agentId),
conversationLength: await this.getConversationLength(orgId, agentId)
}
});
}
}
// Cron job: run idle check every minute
setInterval(() => controller.checkIdleAgents(), 60000);
Scale-to-zero typically saves 60-80% on compute costs for organizations with standard business-hours usage patterns. The tradeoff is 15-30 seconds of cold start time when an agent resumes.
Strategy 2: Spot Instances and Preemptible Nodes
GKE Spot VMs cost 60-91% less than standard instances. Agent workloads tolerate preemption well because we checkpoint state before termination:
apiVersion: v1
kind: Pod
metadata:
name: agent-worker
spec:
# Prefer spot nodes for cost savings
nodeSelector:
cloud.google.com/gke-spot: "true"
tolerations:
- key: cloud.google.com/gke-spot
operator: Equal
value: "true"
effect: NoSchedule
# Graceful handling of preemption
terminationGracePeriodSeconds: 60
containers:
- name: claude-agent
image: gcr.io/agent-ceo/claude-agent:stable
lifecycle:
preStop:
exec:
command:
- /bin/sh
- -c
- |
# Flush metrics and billing
curl -X POST http://localhost:9090/flush
# Save conversation state
curl -X POST http://localhost:8080/checkpoint
# Signal graceful shutdown
kill -SIGTERM 1
resources:
requests:
cpu: "500m"
memory: "2Gi"
limits:
cpu: "2000m"
memory: "8Gi"
---
# Priority class for critical agents that should NOT use spot
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: agent-critical
value: 1000000
globalDefault: false
description: "Critical agents that must not be preempted"
---
# Priority class for standard agents on spot nodes
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: agent-standard
value: 100000
globalDefault: true
description: "Standard agents, tolerant of preemption"
We maintain a small pool of on-demand nodes for agents marked as critical (e.g., production deployment agents), while routing all other workloads to spot:
# Node pool configuration (Terraform)
resource "google_container_node_pool" "spot_agents" {
name = "spot-agents"
cluster = google_container_cluster.agent_platform.name
location = var.region
autoscaling {
min_node_count = 0
max_node_count = 50
}
node_config {
spot = true
machine_type = "e2-standard-4" # 4 vCPU, 16 GB RAM
labels = {
"cloud.google.com/gke-spot" = "true"
"workload-type" = "agent"
}
taint {
key = "cloud.google.com/gke-spot"
value = "true"
effect = "NO_SCHEDULE"
}
}
}
resource "google_container_node_pool" "ondemand_critical" {
name = "ondemand-critical"
cluster = google_container_cluster.agent_platform.name
location = var.region
autoscaling {
min_node_count = 1
max_node_count = 10
}
node_config {
spot = false
machine_type = "e2-standard-4"
labels = {
"workload-type" = "agent-critical"
}
}
}
Strategy 3: Resource Right-Sizing
Most agents do not need their full resource allocation most of the time. We use Vertical Pod Autoscaler (VPA) recommendations to right-size agents based on actual usage:
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: agent-vpa
namespace: org-acme-corp
spec:
targetRef:
apiVersion: "v1"
kind: Pod
name: agent-*
updatePolicy:
updateMode: "Off" # Recommend only, don't auto-apply
resourcePolicy:
containerPolicies:
- containerName: claude-agent
minAllowed:
cpu: "250m"
memory: "1Gi"
maxAllowed:
cpu: "4000m"
memory: "16Gi"
We collect VPA recommendations and apply them during agent restarts:
async function getOptimalResources(orgId: string, agentId: string): Promise<ResourceSpec> {
// Fetch actual usage from Prometheus
const cpuP95 = await queryPrometheus(`
quantile_over_time(0.95,
container_cpu_usage_seconds_total{
pod="${agentId}",
namespace="org-${orgId}"
}[7d]
)
`);
const memP95 = await queryPrometheus(`
quantile_over_time(0.95,
container_memory_usage_bytes{
pod="${agentId}",
namespace="org-${orgId}"
}[7d]
)
`);
// Add 20% buffer above P95
return {
requests: {
cpu: `${Math.ceil(cpuP95 * 1.2 * 1000)}m`,
memory: `${Math.ceil(memP95 * 1.2 / (1024 * 1024 * 1024))}Gi`
},
limits: {
cpu: `${Math.ceil(cpuP95 * 3 * 1000)}m`, // 3x for burst
memory: `${Math.ceil(memP95 * 2 / (1024 * 1024 * 1024))}Gi`
}
};
}
Strategy 4: Intelligent Scheduling
Not all tasks need to run immediately. Batch workloads can be scheduled during off-peak hours when spot instance availability is higher and costs are lower:
class CostAwareScheduler {
async scheduleTask(task: Task, orgId: string): Promise<void> {
const org = await db.doc(`organizations/${orgId}`).get();
const orgData = org.data();
// Check if task can be deferred for cost savings
if (task.priority === 'low' && !task.deadline) {
const currentSpotPrice = await getSpotPricing();
const avgPrice = await getAverageSpotPrice(7); // 7-day average
if (currentSpotPrice > avgPrice * 1.3) {
// Spot prices are 30% above average — defer
const optimalTime = await predictLowCostWindow(4); // Next 4 hours
task.scheduledFor = optimalTime;
await db.collection(`organizations/${orgId}/scheduledTasks`).add({
...task,
reason: 'cost_optimization',
estimatedSavings: `${Math.round((1 - avgPrice / currentSpotPrice) * 100)}%`
});
return;
}
}
// Immediate execution
await dispatchToAgent(task, orgId);
}
}
Cost Dashboard
We provide customers with real-time cost visibility so they can make informed decisions about their agent fleet:
# Current hourly burn rate per organization
sum by (org_id) (
(container_cpu_usage_seconds_total{container="claude-agent"} * 0.032) # CPU cost/hour
+
(container_memory_usage_bytes{container="claude-agent"} / 1073741824 * 0.004) # Memory cost/hour/GB
)
# Projected monthly cost
sum by (org_id) (
rate(agent_compute_cost_dollars_total[1h])
) * 720 # Hours in a month
# Cost savings from scale-to-zero
sum by (org_id) (
agent_paused_hours_total * 0.08 # Cost per agent-hour avoided
)
# Spot vs on-demand ratio
count by (org_id) (
kube_pod_info{node=~".*spot.*", container="claude-agent"}
) / count by (org_id) (
kube_pod_info{container="claude-agent"}
) * 100
Results
Combining all four strategies, here is the cost breakdown for a typical 20-agent fleet:
| Strategy | Monthly Savings | Implementation Effort |
|---|---|---|
| Scale-to-zero | $4,800 (60%) | Medium |
| Spot instances | $2,400 (65% of remaining) | Low |
| Right-sizing | $800 (20% of remaining) | Low |
| Intelligent scheduling | $400 (variable) | Medium |
| Total | $8,400/month | — |
From a baseline of ~$12,000/month for 20 always-on agents, we reduce to ~$3,600/month — a 70% reduction.
These optimizations work in concert with our Stripe billing integration to pass savings through to customers. The real-time monitoring stack provides the utilization signals that drive scaling decisions.
For teams managing their own agent infrastructure, the Kubernetes deployment guide covers the foundational setup, while our scaling AI agents post addresses horizontal scaling when your optimized fleet needs to grow.
Continue reading: Explore the architecture behind agent.ceo, learn about scaling AI agents to 100 concurrent workers, or get started with our 5-minute quickstart guide.
agent.ceo offers both SaaS and enterprise private installation options for organizations of any size.
Try agent.ceo
SaaS — Get started with 1 free agent-week at agent.ceo.
Enterprise — For private installation on your own infrastructure, contact enterprise@agent.ceo.
agent.ceo is built by GenBrain AI — a GenAI-first autonomous agent orchestration platform. General inquiries: hello@agent.ceo | Security: security@agent.ceo