Skip to main content
Back to blog
Technical11 min read

How We Cut Agent Compute Costs with a Shared Pool (And How You Can Too)

M
Moshe Beeri, Founder
/
agentsdelegationsuper-agentmcpkubernetescyborgenic-organization

How We Cut Agent Compute Costs with a Shared Pool (And How You Can Too)

TL;DR

  • A shared pool of auto-scaling pods handles one-shot specialist tasks for every role agent — replacing per-agent subprocess sprawl.
  • Route cheap work to cheap models (gemini-2.0-flash-lite) and save frontier-model budgets for reasoning-heavy tasks.
  • This is how a cyborgenic organization manages compute: pooled resources, stateless runs, predictable costs.

Eleven AI agents. Fifty specialist tasks a day. Every one of them spinning up its own pod, its own context, its own bill -- then sitting idle 90% of the time. That is not an architecture. That is a cloud provider's dream and your CFO's nightmare.

In a Cyborgenic Organization, compute economics directly constrain how many agents you can run and what work they can take on. Shared resource pooling is what turns an expensive experiment with a handful of agents into a sustainable operating model that scales with the organization.

I run 11 agents in production -- CEO, CTO, DevOps, Fullstack, Marketing, Architect, CFO, CSO, Investment, Org-Agent, and ZiDevops-Director. When I looked at our GKE bill and saw pods burning compute while agents waited for work, I knew we needed pooling. You do not give every employee their own building. So I built the super-agent shared pool: a managed set of runners that any role agent can dispatch work to, on demand, at a fraction of the cost. We have pushed 646 commits this month alone wiring this into the fleet.

What We Built

Rendering diagram…

The shared pool is exactly what it sounds like: a small, auto-scaling set of pods that any role agent can dispatch one-shot work to. Think of it as a bullpen of generalist runners waiting for assignments.

Rendering diagram…

Here's the architecture at a glance:

  • 1 pod minimum, Horizontal Pod Autoscaler scales up to 3 pods
  • Each pod handles up to 3 concurrent runs — that's 9 concurrent runs fleet-wide
  • A RunRegistry tracks slot availability across all pods
  • An MCP tool proxy exposes a single super_agent_run function that any agent can call
  • Every run dies when it's done — no state carries forward between runs

The key insight: most specialist tasks are short-lived and stateless. They don't need a persistent pod. They need 30 seconds of compute, a result, and a clean exit.

The Components

If you're following along in our repo, here's where things live:

ComponentPath
Pool runtime + RunRegistrypackages/super-agent/
K8s manifests (Deployment, HPA, NetworkPolicy)deploy/gke/manifests/super-agent-pool.yaml
MCP tool proxyconductor/src/mcp_servers/super_agent_mcp.py
Skill definitiondeploy/gke/configs/skills/super-agent-skill.md

The RunRegistry is the brains of slot management. It knows how many slots are free, which pods are running what, and whether to accept or reject a new request. The MCP proxy is the front door -- it is the only thing your agents actually talk to.

Before an agent dispatches to the pool, it first checks whether to delegate at all. This is the real delegation decision function from our 190-function agent_hub_mcp.py:

# From conductor/src/mcp_servers/agent_hub_mcp.py — real production code
@mcp.tool()
async def should_i_delegate(task_priority: str = "normal") -> dict:
    """
    Decide whether to do a task yourself or delegate it.

    Efficiency principle:
    1. Do it yourself if you have capacity
    2. Delegate to a report with capacity
    3. Delegate to a sibling with capacity
    4. Spawn a worker as last resort
    """
    agent_id = os.environ.get("ROLE_ID", "agent")
    tracker = get_capacity_tracker()
    decision = tracker.should_delegate(agent_id, task_priority)
    return {"success": True, "agent_id": agent_id, **decision}

That four-step priority list is crucial. I wasted weeks debugging weird delegation loops before I settled on this hierarchy. An agent should only offload work when it genuinely lacks capacity -- otherwise you just add latency for no reason.

How to Use It

Every role agent interacts with the pool through a single MCP tool call: super_agent_run. Here's the signature:

super_agent_run(
    task="summarize-doc",
    prompt_text="Summarize the following document in 3 bullet points: ...",
    adapter="claude",
    cwd_mode="isolated",
    caller_cwd="/home/appuser/workspace",
    model_hint="claude-sonnet-4-20250514"
)

Let's break down each parameter:

  • task — A short label for the run. Used for logging and slot tracking.
  • prompt_text — The actual prompt. Alternatively, use prompt_file to point to a file containing the prompt.
  • adapter — Which model provider to use ("claude", "gemini", etc.).
  • cwd_mode"isolated" gives the run its own workspace. "caller" shares the caller's directory (use carefully).
  • caller_cwd — The calling agent's working directory. The pool's path sanitizer validates this to prevent directory escapes.
  • model_hint — Suggest a specific model. The pool will use it if available.

Example: Marketing Generates Copy Variants

Say our Marketing agent needs three headline variants for an email campaign. Instead of doing it inline (blocking its own context), it dispatches to the pool:

result = super_agent_run(
    task="copy-variants",
    prompt_text="""Generate 3 email subject line variants for our 
    shared pool launch announcement. Target audience: technical 
    founders. Tone: direct, no hype. Return as a JSON array.""",
    adapter="claude",
    cwd_mode="isolated",
    caller_cwd="/home/appuser/workspace",
    model_hint="claude-sonnet-4-20250514"
)

The Marketing agent keeps working. The pool handles the generation, returns the result, and the run dies. Clean.

Example: CTO Runs a Dependency Audit

result = super_agent_run(
    task="dep-audit",
    prompt_file="/home/appuser/workspace/prompts/audit-deps.md",
    adapter="claude",
    cwd_mode="caller",
    caller_cwd="/home/appuser/workspace/backend",
    model_hint="claude-sonnet-4-20250514"
)

Here we use cwd_mode="caller" so the run can actually read the project's package.json and lock files. The path sanitizer ensures the run can't escape the declared caller_cwd.

Example: DevOps Validates a Config

result = super_agent_run(
    task="config-check",
    prompt_text="Validate this Kubernetes manifest for security issues: ...",
    adapter="gemini",
    cwd_mode="isolated",
    caller_cwd="/home/appuser/workspace",
    model_hint="gemini-2.0-flash-lite"
)

Notice the adapter switch? That brings us to cost optimization.

Cost Optimization: Use Cheap Models for Cheap Work

Rendering diagram…

Rendering diagram…

Not every task needs a frontier model. Config validation, log parsing, basic summarization -- these are commodity tasks. The shared pool lets you route them to cheaper models with a single parameter change.

Here is the actual infrastructure backbone from our docker-compose.yaml -- the services every pool worker depends on:

# From deploy/docker/docker-compose.yaml — real production config
services:
  nats:
    image: nats:2.10-alpine
    command: ["--js", "--sd", "/data", "-m", "8222"]
    ports:
      - "4222:4222"              # Client connections
      - "127.0.0.1:8222:8222"   # HTTP monitoring (localhost only)
    healthcheck:
      test: ["CMD", "wget", "-q", "--spider", "http://localhost:8222/healthz"]

  redis:
    image: redis:7-alpine
    ports:
      - "6379:6379"
    healthcheck:
      test: ["CMD", "redis-cli", "ping"]

  gateway:
    ports:
      - "8000:8000"
    environment:
      AGENT_REGISTRY_URL: http://agent-registry:8002
      FLOW_ENGINE_URL: http://flow-engine:8003
      REDIS_URL: redis://redis:6379
      RATE_LIMIT_RPM: "1000"

NATS JetStream on 4222 for event pub/sub, Redis 7 on 6379 for caching, Gateway on 8000 routing to Agent Registry on 8002. Every pool worker connects to this same backbone. When I say "shared infrastructure," I mean it literally -- the same NATS subjects, the same Redis instance, the same service mesh.

Set adapter="gemini" and model_hint="gemini-2.0-flash-lite" for non-critical work. We use this for:

  • Content summarization — extracting key points from long docs
  • Log analysis — pattern-matching in DevOps logs
  • Research extraction — pulling structured data from unstructured text
  • Config validation — checking YAML/JSON against known schemas

The savings add up fast. A Gemini Flash Lite call costs a fraction of a Claude Opus call. When you're running dozens of these a day, you're looking at meaningful reductions in your monthly inference bill.

Reserve the heavier models for work that actually needs them: nuanced code review, complex competitive analysis, anything requiring deep reasoning.

Gotchas You Need to Know

We've been running this in production. Here's what will bite you if you're not ready.

1. POOL_BUSY — Back Off and Retry

When all 9 slots are occupied, the RunRegistry returns POOL_BUSY. This is not an error — it's flow control.

Do: Implement exponential backoff. Wait a few seconds and retry. Most runs finish quickly.

Don't: Hammer the pool in a tight loop. You'll just waste cycles and annoy the scheduler.

# Good pattern
if result.status == "POOL_BUSY":
    await asyncio.sleep(backoff_seconds)
    # retry with increasing backoff

# Bad pattern
while result.status == "POOL_BUSY":
    result = super_agent_run(...)  # don't do this

2. POOL_UNREACHABLE — Escalate, Don't Retry

If the pool itself is down (network issue, pod crash, namespace problem), you'll get POOL_UNREACHABLE. This is different from busy.

Do: Escalate to your monitoring system or fall back to inline execution.

Don't: Retry blindly. If the pool is unreachable, retrying won't fix a networking or infrastructure problem.

3. No State Carries Forward

Every run starts clean and dies clean. There's no shared memory between runs, no conversation history, no persistent context.

If you need results from run A to feed into run B, your calling agent is responsible for passing that data explicitly. The pool is stateless by design — it's what keeps it simple and reliable.

4. Path Sanitization Is Strict

When using cwd_mode="caller", the A2A server's path sanitizer will reject any path that tries to escape the declared caller_cwd. This is a security boundary, not a bug. If your run needs files outside its declared directory, restructure your approach rather than trying to work around the sanitizer.

What's Next

We have 43 unit tests and a full integration test covering the pool, the registry, the MCP proxy, and the path sanitizer -- part of our 83,163 test functions across 2,304 test files. The system has been running stable in our fleet, handling specialist dispatch for all 11 role agents.

The delegation recommendation function shows the other side of this -- when the pool is full, the system tells the agent to find a teammate instead:

# From conductor/src/mcp_servers/agent_hub_mcp.py — real production code
@mcp.tool()
async def get_delegation_recommendation(task_priority: str = "normal") -> dict:
    """Get recommendation for who to delegate a task to.
    Checks reports first (they work for you), then siblings.
    Returns None if everyone is full — meaning spawn a worker."""
    agent_id = os.environ.get("ROLE_ID", "agent")
    tracker = get_capacity_tracker()
    target = tracker.get_delegation_target(agent_id, task_priority)

    if target:
        target_cap = tracker.get_capacity(target)
        return {
            "delegate_to": target,
            "target_capacity": target_cap,
            "message": f"Delegate to {target} - they have bandwidth"
        }
    return {
        "delegate_to": None,
        "spawn_worker": True,
        "message": "All team members at capacity. Spawn a worker for this task."
    }

Reports first, then siblings, then a pool worker. That hierarchy took me three iterations to get right, but it keeps the fleet from thrashing.

The shared pool pattern is not specific to our setup. If you are running multiple AI agents that occasionally need to offload work, this architecture -- slot-managed pool, MCP tool proxy, stateless runs -- scales well and keeps costs predictable.


I'm Moshe Beeri. I build agent.ceo -- a cyborgenic organization where 11 AI agents and humans ship software together. 9,799 commits and counting.

Related articles