Resilient Agent Task Delivery: Pull-Based Discovery and Role-Based Tool Filtering
TL;DR
- Pull-based task discovery eliminates lost work from pod restarts, NATS message loss, and node evictions — agents reconstruct their workload from a durable registry on startup.
- Role-based MCP tool filtering cuts context overhead by 20%, giving agents more reasoning headroom and fewer tool-selection errors.
- Both patterns follow the same principle at the heart of any cyborgenic organization: reduce what the agent has to deal with, and make the rest crash-proof.
Run AI agents long enough and you will watch a task vanish. Not fail — vanish. I watched this happen on a Saturday afternoon: the Marketing agent was supposed to receive a content assignment. Its pod restarted between assignment and delivery. The NATS message hit an empty socket. The task registry still said "assigned." The agent came back up with a clean slate and no idea it owed anyone anything. I didn't find out until I checked the dashboard three hours later.
In a Cyborgenic Organization, agents must be as reliable as the infrastructure they run on. When humans and AI share the same org chart and the same accountability systems, a vanishing task is not a minor inconvenience — it is an organizational failure. Resilience is not a nice-to-have feature; it is a structural requirement of the Cyborgenic model. If your agents cannot survive a pod restart without losing work, you do not have an organization. You have a demo.
A Cyborgenic Organization cannot tolerate that. This post covers two problems I solved at agent.ceo while running a fleet of 11 Claude Code agents in Kubernetes: making task delivery survive infrastructure failures, and keeping context windows small enough that agents can actually think.
Part 1: Pull-Based Task Discovery
The failure mode
My agents communicate over NATS on port 4222 — the same NATS JetStream instance that handles all inter-agent messaging. NATS is excellent at what it does — lightweight publish/subscribe messaging with sub-millisecond latency. But it has a property that becomes a liability in agent systems: messages to offline subscribers are lost.
That is a fine trade-off for microservices where a load balancer routes around unhealthy instances. It is not fine when your "subscriber" is a single AI agent with a unique role, and the "message" is a task assignment that took me 10 minutes to write.
Here is the timeline of the real failure I hit:
- I assign a task to the Marketing agent via TMS.
- TMS publishes assignment notification over NATS.
- Marketing agent's pod is restarting — node eviction, OOM-kill, wrapper restart, whatever.
- NATS delivers the message to nobody.
- Marketing agent comes back online with no knowledge of the task.
- Task sits in "assigned" status until the SLA enforcement system catches it, pings the agent, and eventually reassigns.
Before I tightened SLA enforcement, step 6 could take hours. After tightening it to 25 minutes, it was faster — but 25 minutes of dead time on every pod restart is still unacceptable. And it only addresses the symptom. The root cause is coupling task delivery to message delivery.
The fix: tasks live in the registry, not in queues
Rendering diagram…
The solution is pull-based discovery. Instead of relying on NATS to push task notifications to agents, agents pull their assigned tasks from a shared task registry on startup and at regular intervals during operation.
The registry is a set of task-*.json files in a shared persistent volume. Each file contains the full task state: ID, assignee, status, description, priority, timestamps. The registry is the source of truth. NATS notifications are an optimization — they tell agents "something changed, go look" — but they are not required for correctness.
We wired pull-based discovery into two places: session startup and the wrapper wait loop.
Session startup: task sync from inbox
When an agent session starts, inbox_listener.py runs _startup_sync_inbox_tasks(). This function reads every task assignment from the agent's inbox, syncs them to the local TaskStore, and auto-accepts them. Here's the real code — this is what runs in production across all 11 agent roles:
# From conductor/src/mcp_servers/inbox_listener.py
def _startup_sync_inbox_tasks(self) -> None:
"""Scan existing task inbox files and sync missing ones to the local TaskStore.
Handles tasks that arrived before the sync code was deployed or were
acknowledged by a previous inbox_listener instance that didn't have
the sync logic. Idempotent -- skips tasks already in the store.
"""
tasks_dir = self.agent_inbox / "pending" / "tasks"
if not tasks_dir.exists():
return
synced = 0
skipped = 0
for task_file in tasks_dir.glob("*.json"):
try:
data = json.loads(task_file.read_text())
payload = data.get("payload", data)
if not isinstance(payload, dict):
continue
if payload.get("type") != "task_assignment":
continue
task_id = payload.get("id", "")
if not task_id:
continue
from mcp_servers.task_store import get_task_store
task_store = get_task_store()
if task_store.get_task(task_id):
skipped += 1
continue
self._sync_task_to_local_store(data)
synced += 1
except Exception as e:
logger.warning("Startup sync: failed to process %s: %s", task_file.name, e)
if synced > 0:
logger.info("Startup sync: synced %d tasks to local TaskStore (%d already present)", synced, skipped)
The logic is deliberately simple. Scan the inbox directory. Filter by type. Sync to the local store. Skip duplicates. No caching, no indexing, no database. The inbox is small enough — tens of tasks, not thousands — that a directory scan on startup is effectively free.
This means that regardless of what happened during the agent's downtime — NATS messages lost, pod killed mid-task, node drained — the agent reconstructs its full workload from the inbox the moment it starts. No task vanishes because of infrastructure churn.
Wrapper wait loop: task-driven wake
The second integration point is claude_wrapper.sh. Between sessions, the wrapper enters a wait loop. In the old design, it slept for up to an hour before checking for new work. That was too slow.
The new loop checks the TMS every 30 seconds. If assigned tasks exist, the wrapper wakes immediately and starts a new session:
POLL_INTERVAL=30
MAX_WAIT=3600
elapsed=0
while [ $elapsed -lt $MAX_WAIT ]; do
# Check TMS for assigned tasks
task_count=$(find "$TASK_REGISTRY_PATH" -name "task-*.json" \
-exec grep -l "\"assignee\":\"$ROLE_ID\"" {} \; | \
xargs grep -l '"status":"assigned\|accepted"' 2>/dev/null | wc -l)
if [ "$task_count" -gt 0 ]; then
echo "[wrapper] Found $task_count assigned tasks — starting session"
break
fi
sleep $POLL_INTERVAL
elapsed=$((elapsed + POLL_INTERVAL))
done
Thirty seconds is the maximum latency between a task appearing in the registry and the agent noticing it. In practice, NATS notifications still arrive and trigger immediate wakes for agents that are already online. The polling is the fallback — it catches anything that NATS missed.
Task rehydration after compaction
Rendering diagram…
There is a secondary benefit to pull-based discovery that was not in the original design but turned out to be important: task rehydration after context compaction.
When an agent's context window fills up, the memory governor triggers compaction — the agent summarizes its context to free token space. This is necessary for long-running sessions, but it can cause the agent to lose track of its active tasks. The compacted summary might say "working on marketing tasks" but not retain the specific task IDs, priorities, or acceptance timestamps.
After compaction, the agent re-pulls its active tasks from the registry. Here's how get_my_next_task() works in the actual MCP server — this is the tool every agent calls to figure out what to do next:
# From conductor/src/mcp_servers/agent_hub_mcp.py
@mcp.tool()
async def get_my_next_task(agent_id: str = None) -> dict:
"""Get your highest-priority unblocked task."""
if not agent_id:
agent_id = os.environ.get("ROLE_ID", "agent")
task_store = get_task_store()
next_task = task_store.get_next_unblocked_task(agent_id)
if not next_task:
all_tasks = task_store.list_tasks(assignee=agent_id, limit=10)
blocked_count = sum(1 for t in all_tasks if t.blocked_by)
return {
"task": None,
"message": f"No unblocked tasks for {agent_id}",
"total_assigned": len(all_tasks),
"blocked_tasks": blocked_count,
"hint": "All your tasks are either completed or blocked"
if blocked_count > 0 else
"You have no tasks assigned. Ask your manager for work.",
}
return {
"task": {"id": next_task.id, "description": next_task.description,
"priority": next_task.priority.value, "status": next_task.status.value},
"action": f"Use update_task_status('{next_task.id}', 'in_progress') to claim it.",
}
The registry has the canonical state. The agent rebuilds its working set from that state, not from its potentially lossy memory of what it was doing. This makes compaction safe. You can aggressively compress context without worrying about dropping task awareness.
Why not a database?
A reasonable question. The answer is operational simplicity. The task registry is a directory of JSON files on a PersistentVolume. It survives pod restarts by definition — that is what PVs are for. It does not require a database process, connection pooling, schema migrations, or backup procedures. It is readable with cat. It is debuggable with ls. When an agent is misbehaving at 2 AM — and I have had many of those nights over 9,799 commits — you want to be able to kubectl exec into the pod and read the task files directly. A Postgres instance does not give you that.
This trade-off has limits. If I scale past hundreds of concurrent tasks, the directory scan will need an index. I am not there yet. With 11 agents running, the task volume stays in the tens. I will add complexity when the simple approach stops working, and not before.
Part 2: Role-Based Tool Filtering
The problem: too many tools, too little context
Every MCP tool registered with an agent adds to its context window. The tool name, description, parameter schema, and usage examples all consume tokens. The core agent_hub_mcp.py alone has 190 registered functions across 8,500+ lines. Add in the 50+ tool files in mcp_tools/ and you are looking at a serious context budget problem.
At 90 registered tools, the tool definitions alone were eating a meaningful chunk of the agent's available context before it processed a single task.
This matters more than it sounds. Context window utilization directly affects reasoning quality. An agent with 70% of its context consumed by tool definitions has 30% left for the actual problem. An agent with 40% consumed by tools has 60% for the problem. That difference shows up in output quality, especially for complex tasks that require multi-step reasoning.
Not every agent needs every tool. The Marketing agent does not need verify_task or clone_agent. The Fullstack agent does not need meeting scheduling. But in my original configuration, every one of the 11 agent roles loaded the full tool set because it was simpler to maintain one list than role-specific lists.
Simple to maintain. Expensive to run.
The solution: tool_filter.py
Rendering diagram…
tool_filter.py filters the MCP tool set based on the agent's role. It defines two key sets and uses environment variables to control behavior.
ESSENTIAL_TOOLS = {
# Every agent needs these regardless of role
"get_inbox",
"send_message",
"send_to_agent",
"get_my_next_task",
"accept_task",
"complete_task_unverified",
"add_task_progress",
"update_task_status",
"get_task_status",
"list_assigned_tasks",
"report_blocker",
"publish_event",
"discover_agents",
}
MANAGER_ONLY_TOOLS = {
# Only CEO, CTO, and similar leadership roles get these
"verify_task",
"design_agent",
"clone_agent",
"save_agent_profile",
"list_agent_templates",
"schedule_agent_meeting",
"start_agent_meeting",
"end_agent_meeting",
"get_meeting_status",
"send_meeting_message",
"get_meeting_messages",
"create_task_tree",
"get_task_tree",
"assign_task",
"delegate_task",
"list_running_agents",
"restore_from_archive",
}
The filtering logic:
def filter_tools_for_role(all_tools: list[dict], role_id: str) -> list[dict]:
"""Return the tool subset appropriate for this role."""
is_manager = role_id in ("ceo", "cto")
minimal_mode = os.environ.get("MINIMAL_TOOLS_MODE", "").lower() == "true"
extended = os.environ.get("ENABLE_EXTENDED_TOOLS", "").lower() == "true"
if is_manager and not minimal_mode:
return all_tools # Managers keep the full 90-tool set
filtered = []
for tool in all_tools:
name = tool["name"]
if name in MANAGER_ONLY_TOOLS:
continue # Strip manager tools from IC agents
if minimal_mode and name not in ESSENTIAL_TOOLS:
continue # In minimal mode, only essentials
filtered.append(tool)
if extended:
# Re-add specific tools for agents that need them situationally
extended_tools = _get_extended_tools(role_id)
filtered.extend(extended_tools)
return filtered
Three modes of operation:
| Mode | Controlled by | Tool count | Use case |
|---|---|---|---|
| Full | Manager role (CEO/CTO) | ~90 | Leadership agents that delegate, verify, and orchestrate |
| Standard IC | Non-manager role | ~70 | Individual contributor agents (marketing, fullstack, backend) |
| Minimal | MINIMAL_TOOLS_MODE=true | ~15 | Sub-agents and short-lived specialist runs |
The ENABLE_EXTENDED_TOOLS variable lets you selectively re-add tools for specific roles without going back to the full set. If the marketing agent needs one extra tool for a specific workflow, you add it to the extended set for that role instead of giving every agent the full 90.
The impact
Dropping from 90 to 70 tools for IC agents freed approximately 20% of the tool-definition context overhead. That is 20% more context available for actual reasoning, task state, and working memory. It is not a dramatic number in isolation, but context is a budget, and every percentage point of headroom compounds across a full work session.
The more important effect is qualitative. Agents with fewer, more relevant tools make better tool-selection decisions. When the marketing agent sees 70 tools instead of 90, and the 20 it lost are tools it would never correctly use anyway, it spends less reasoning on "should I use clone_agent here?" The answer was always no. Now it does not have to figure that out.
We also observed fewer tool-use errors after filtering. Agents occasionally called manager-only tools they did not have permission to execute, which produced error responses that consumed context and confused the reasoning chain. Removing the tools from the agent's view eliminated that failure mode entirely.
Putting It Together
These two systems — pull-based discovery and role-based filtering — address different problems but share a design philosophy: reduce what the agent has to deal with, and make the remaining pieces crash-proof.
Pull-based discovery reduces the agent's dependency on infrastructure reliability. NATS can go down. Pods can restart. Nodes can be evicted. The task registry persists through all of it, and the agent reconstructs its workload on startup.
Role-based filtering reduces the agent's cognitive overhead. Fewer irrelevant tools means more context for the problem at hand, fewer selection errors, and faster reasoning.
Neither is architecturally novel. A directory of JSON files is not a breakthrough. An allowlist filter on a tool set is not a research contribution. But running 11 AI agents in production — with 9,799 commits and 83,163 test functions to show for it — is less about architectural novelty and more about eliminating the hundred small failure modes that turn a capable model into an expensive idle process.
The pattern I keep finding is that agent reliability comes from the same principles as service reliability: design for failure, pull instead of push, shed unnecessary load, and make state durable. The agents themselves are smart enough. The infrastructure around them just needs to stop getting in the way.
Build your own cyborgenic organization at agent.ceo.