How We Cut Agent Compute Costs with a Shared Pool (And How You Can Too)

TL;DR

A shared pool of auto-scaling pods handles one-shot specialist tasks for every role agent — replacing per-agent subprocess sprawl.

Route cheap work to cheap models (gemini-2.0-flash-lite) and save frontier-model budgets for reasoning-heavy tasks.

This is how a cyborgenic organization manages compute: pooled resources, stateless runs, predictable costs.

Six AI agents. Fifty specialist tasks a day. Every one of them spinning up its own pod, its own context, its own bill — then sitting idle 90% of the time. That is not an architecture. That is a cloud provider's dream and your CFO's nightmare.

Inside a cyborgenic organization, agents do not operate in isolation — they share infrastructure the same way human teams share conference rooms. You do not give every employee their own building. So we built the super-agent shared pool: a managed set of runners that any role agent can dispatch work to, on demand, at a fraction of the cost.

What We Built

The shared pool is exactly what it sounds like: a small, auto-scaling set of pods that any role agent can dispatch one-shot work to. Think of it as a bullpen of generalist runners waiting for assignments.

Here's the architecture at a glance:

1 pod minimum, Horizontal Pod Autoscaler scales up to 3 pods
Each pod handles up to 3 concurrent runs — that's 9 concurrent runs fleet-wide
A RunRegistry tracks slot availability across all pods
An MCP tool proxy exposes a single super_agent_run function that any agent can call
Every run dies when it's done — no state carries forward between runs

The key insight: most specialist tasks are short-lived and stateless. They don't need a persistent pod. They need 30 seconds of compute, a result, and a clean exit.

The Components

If you're following along in our repo, here's where things live:

Component	Path
Pool runtime + RunRegistry	`packages/super-agent/`
K8s manifests (Deployment, HPA, NetworkPolicy)	`deploy/gke/manifests/super-agent-pool.yaml`
MCP tool proxy	`conductor/src/mcp_servers/super_agent_mcp.py`
Skill definition	`deploy/gke/configs/skills/super-agent-skill.md`

The RunRegistry is the brains of slot management. It knows how many slots are free, which pods are running what, and whether to accept or reject a new request. The MCP proxy is the front door — it's the only thing your agents actually talk to.

How to Use It

Every role agent interacts with the pool through a single MCP tool call: super_agent_run. Here's the signature:

super_agent_run(
    task="summarize-doc",
    prompt_text="Summarize the following document in 3 bullet points: ...",
    adapter="claude",
    cwd_mode="isolated",
    caller_cwd="/home/appuser/workspace",
    model_hint="claude-sonnet-4-20250514"
)

Let's break down each parameter:

task — A short label for the run. Used for logging and slot tracking.
prompt_text — The actual prompt. Alternatively, use prompt_file to point to a file containing the prompt.
adapter — Which model provider to use ("claude", "gemini", etc.).
cwd_mode — "isolated" gives the run its own workspace. "caller" shares the caller's directory (use carefully).
caller_cwd — The calling agent's working directory. The pool's path sanitizer validates this to prevent directory escapes.
model_hint — Suggest a specific model. The pool will use it if available.

Example: Marketing Generates Copy Variants

Say our Marketing agent needs three headline variants for an email campaign. Instead of doing it inline (blocking its own context), it dispatches to the pool:

result = super_agent_run(
    task="copy-variants",
    prompt_text="""Generate 3 email subject line variants for our 
    shared pool launch announcement. Target audience: technical 
    founders. Tone: direct, no hype. Return as a JSON array.""",
    adapter="claude",
    cwd_mode="isolated",
    caller_cwd="/home/appuser/workspace",
    model_hint="claude-sonnet-4-20250514"
)

The Marketing agent keeps working. The pool handles the generation, returns the result, and the run dies. Clean.

Example: CTO Runs a Dependency Audit

result = super_agent_run(
    task="dep-audit",
    prompt_file="/home/appuser/workspace/prompts/audit-deps.md",
    adapter="claude",
    cwd_mode="caller",
    caller_cwd="/home/appuser/workspace/backend",
    model_hint="claude-sonnet-4-20250514"
)

Here we use cwd_mode="caller" so the run can actually read the project's package.json and lock files. The path sanitizer ensures the run can't escape the declared caller_cwd.

Example: DevOps Validates a Config

result = super_agent_run(
    task="config-check",
    prompt_text="Validate this Kubernetes manifest for security issues: ...",
    adapter="gemini",
    cwd_mode="isolated",
    caller_cwd="/home/appuser/workspace",
    model_hint="gemini-2.0-flash-lite"
)

Notice the adapter switch? That brings us to cost optimization.

Cost Optimization: Use Cheap Models for Cheap Work

Not every task needs a frontier model. Config validation, log parsing, basic summarization — these are commodity tasks. The shared pool lets you route them to cheaper models with a single parameter change.

Set adapter="gemini" and model_hint="gemini-2.0-flash-lite" for non-critical work. We use this for:

Content summarization — extracting key points from long docs
Log analysis — pattern-matching in DevOps logs
Research extraction — pulling structured data from unstructured text
Config validation — checking YAML/JSON against known schemas

The savings add up fast. A Gemini Flash Lite call costs a fraction of a Claude Opus call. When you're running dozens of these a day, you're looking at meaningful reductions in your monthly inference bill.

Reserve the heavier models for work that actually needs them: nuanced code review, complex competitive analysis, anything requiring deep reasoning.

Gotchas You Need to Know

We've been running this in production. Here's what will bite you if you're not ready.

1. POOL_BUSY — Back Off and Retry

When all 9 slots are occupied, the RunRegistry returns POOL_BUSY. This is not an error — it's flow control.

Do: Implement exponential backoff. Wait a few seconds and retry. Most runs finish quickly.

Don't: Hammer the pool in a tight loop. You'll just waste cycles and annoy the scheduler.

# Good pattern
if result.status == "POOL_BUSY":
    await asyncio.sleep(backoff_seconds)
    # retry with increasing backoff

# Bad pattern
while result.status == "POOL_BUSY":
    result = super_agent_run(...)  # don't do this

2. POOL_UNREACHABLE — Escalate, Don't Retry

If the pool itself is down (network issue, pod crash, namespace problem), you'll get POOL_UNREACHABLE. This is different from busy.

Do: Escalate to your monitoring system or fall back to inline execution.

Don't: Retry blindly. If the pool is unreachable, retrying won't fix a networking or infrastructure problem.

3. No State Carries Forward

Every run starts clean and dies clean. There's no shared memory between runs, no conversation history, no persistent context.

If you need results from run A to feed into run B, your calling agent is responsible for passing that data explicitly. The pool is stateless by design — it's what keeps it simple and reliable.

4. Path Sanitization Is Strict

When using cwd_mode="caller", the A2A server's path sanitizer will reject any path that tries to escape the declared caller_cwd. This is a security boundary, not a bug. If your run needs files outside its declared directory, restructure your approach rather than trying to work around the sanitizer.

What's Next

We've got 43 unit tests and a full integration test covering the pool, the registry, the MCP proxy, and the path sanitizer. The system has been running stable in our fleet, handling specialist dispatch for all six role agents.

The shared pool pattern isn't specific to our setup. If you're running multiple AI agents that occasionally need to offload work, this architecture — slot-managed pool, MCP tool proxy, stateless runs — scales well and keeps costs predictable.

Build your own cyborgenic organization at agent.ceo. The fleet is live, the agents are working, and we ship what we learn every week. Come see how the pieces fit together.

How We Cut Agent Compute Costs with a Shared Pool (And How You Can Too)

How We Cut Agent Compute Costs with a Shared Pool (And How You Can Too)

What We Built

The Components

How to Use It

Example: Marketing Generates Copy Variants

Example: CTO Runs a Dependency Audit

Example: DevOps Validates a Config

Cost Optimization: Use Cheap Models for Cheap Work

Gotchas You Need to Know

1. POOL_BUSY — Back Off and Retry

2. POOL_UNREACHABLE — Escalate, Don't Retry

3. No State Carries Forward

4. Path Sanitization Is Strict

What's Next

RELATED_DEEP_DIVES

Cloud Onboarding in 10 Minutes: IAM Templates for AWS, GCP, and Azure

From Discovery to Agents: Building an Automatic Agent Type Recommender

Resilient Agent Task Delivery: Pull-Based Discovery and Role-Based Tool Filtering