Resilient Agent Task Delivery: Pull-Based Discovery and Role-Based Tool Filtering
TL;DR
- Pull-based task discovery eliminates lost work from pod restarts, NATS message loss, and node evictions — agents reconstruct their workload from a durable registry on startup.
- Role-based MCP tool filtering cuts context overhead by 20%, giving agents more reasoning headroom and fewer tool-selection errors.
- Both patterns follow the same principle at the heart of any cyborgenic organization: reduce what the agent has to deal with, and make the rest crash-proof.
Run AI agents long enough and you will watch a task vanish. Not fail — vanish. The agent that was supposed to receive it restarted between assignment and delivery. The NATS message hit an empty socket. The task registry still says "assigned." The agent comes back up with a clean slate and no idea it owes anyone anything.
A cyborgenic organization cannot tolerate that. This post covers two problems we solved at agent.ceo while running a fleet of Claude Code agents in Kubernetes: making task delivery survive infrastructure failures, and keeping context windows small enough that agents can actually think.
Part 1: Pull-Based Task Discovery
The failure mode
Our agents communicate over NATS. NATS is excellent at what it does — lightweight publish/subscribe messaging with sub-millisecond latency. But it has a property that becomes a liability in agent systems: messages to offline subscribers are lost.
That is a fine trade-off for microservices where a load balancer routes around unhealthy instances. It is not fine when your "subscriber" is a single AI agent with a unique role, and the "message" is a task assignment that took a founder 10 minutes to write.
Here is the timeline of the failure:
- Founder assigns a task to the marketing agent via TMS.
- TMS publishes assignment notification over NATS.
- Marketing agent's pod is restarting — node eviction, OOM-kill, wrapper restart, whatever.
- NATS delivers the message to nobody.
- Marketing agent comes back online with no knowledge of the task.
- Task sits in "assigned" status until the SLA enforcement system catches it, pings the agent, and eventually reassigns.
Before we tightened SLA enforcement, step 6 could take hours. After tightening it to 25 minutes, it was faster — but 25 minutes of dead time on every pod restart is still unacceptable. And it only addresses the symptom. The root cause is coupling task delivery to message delivery.
The fix: tasks live in the registry, not in queues
The solution is pull-based discovery. Instead of relying on NATS to push task notifications to agents, agents pull their assigned tasks from a shared task registry on startup and at regular intervals during operation.
The registry is a set of task-*.json files in a shared persistent volume. Each file contains the full task state: ID, assignee, status, description, priority, timestamps. The registry is the source of truth. NATS notifications are an optimization — they tell agents "something changed, go look" — but they are not required for correctness.
We wired pull-based discovery into two places: session startup and the wrapper wait loop.
Session startup: _populate_ralph_backlog()
When an agent session starts, session_start.py calls _populate_ralph_backlog(). This function reads every task-*.json file from the registry, filters by the agent's ROLE_ID, and pulls in tasks with status assigned or accepted.
def _populate_ralph_backlog():
"""Pull assigned/accepted tasks from registry into agent backlog."""
registry_path = Path(os.environ["TASK_REGISTRY_PATH"])
role_id = os.environ["ROLE_ID"]
backlog = []
for task_file in registry_path.glob("task-*.json"):
task = json.loads(task_file.read_text())
if task["assignee"] != role_id:
continue
if task["status"] not in ("assigned", "accepted"):
continue
backlog.append(task)
backlog.sort(key=lambda t: t.get("priority", 99))
return backlog
The logic is deliberately simple. Glob the directory. Filter by role and status. Sort by priority. No caching, no indexing, no database. The registry is small enough — tens of tasks, not thousands — that a directory scan on startup is effectively free.
This means that regardless of what happened during the agent's downtime — NATS messages lost, pod killed mid-task, node drained — the agent reconstructs its full workload from the registry the moment it starts. No task vanishes because of infrastructure churn.
Wrapper wait loop: task-driven wake
The second integration point is claude_wrapper.sh. Between sessions, the wrapper enters a wait loop. In the old design, it slept for up to an hour before checking for new work. That was too slow.
The new loop checks the TMS every 30 seconds. If assigned tasks exist, the wrapper wakes immediately and starts a new session:
POLL_INTERVAL=30
MAX_WAIT=3600
elapsed=0
while [ $elapsed -lt $MAX_WAIT ]; do
# Check TMS for assigned tasks
task_count=$(find "$TASK_REGISTRY_PATH" -name "task-*.json" \
-exec grep -l "\"assignee\":\"$ROLE_ID\"" {} \; | \
xargs grep -l '"status":"assigned\|accepted"' 2>/dev/null | wc -l)
if [ "$task_count" -gt 0 ]; then
echo "[wrapper] Found $task_count assigned tasks — starting session"
break
fi
sleep $POLL_INTERVAL
elapsed=$((elapsed + POLL_INTERVAL))
done
Thirty seconds is the maximum latency between a task appearing in the registry and the agent noticing it. In practice, NATS notifications still arrive and trigger immediate wakes for agents that are already online. The polling is the fallback — it catches anything that NATS missed.
Task rehydration after compaction
There is a secondary benefit to pull-based discovery that was not in the original design but turned out to be important: task rehydration after context compaction.
When an agent's context window fills up, the memory governor triggers compaction — the agent summarizes its context to free token space. This is necessary for long-running sessions, but it can cause the agent to lose track of its active tasks. The compacted summary might say "working on marketing tasks" but not retain the specific task IDs, priorities, or acceptance timestamps.
After compaction, the agent re-pulls its active tasks from the registry. The registry has the canonical state. The agent rebuilds its working set from that state, not from its potentially lossy memory of what it was doing. This makes compaction safe. You can aggressively compress context without worrying about dropping task awareness.
Why not a database?
A reasonable question. The answer is operational simplicity. The task registry is a directory of JSON files on a PersistentVolume. It survives pod restarts by definition — that is what PVs are for. It does not require a database process, connection pooling, schema migrations, or backup procedures. It is readable with cat. It is debuggable with ls. When an agent is misbehaving at 2 AM, you want to be able to kubectl exec into the pod and read the task files directly. A Postgres instance does not give you that.
This trade-off has limits. If we scale past hundreds of concurrent tasks, the directory scan will need an index. We are not there yet. We will add complexity when the simple approach stops working, and not before.
Part 2: Role-Based Tool Filtering
The problem: too many tools, too little context
Every MCP tool registered with an agent adds to its context window. The tool name, description, parameter schema, and usage examples all consume tokens. At 90 registered tools, the tool definitions alone were eating a meaningful chunk of the agent's available context before it processed a single task.
This matters more than it sounds. Context window utilization directly affects reasoning quality. An agent with 70% of its context consumed by tool definitions has 30% left for the actual problem. An agent with 40% consumed by tools has 60% for the problem. That difference shows up in output quality, especially for complex tasks that require multi-step reasoning.
Not every agent needs every tool. The marketing agent does not need verify_task or clone_agent. The backend agent does not need meeting scheduling. But in our original configuration, every agent loaded the full tool set because it was simpler to maintain one list than role-specific lists.
Simple to maintain. Expensive to run.
The solution: tool_filter.py
tool_filter.py filters the MCP tool set based on the agent's role. It defines two key sets and uses environment variables to control behavior.
ESSENTIAL_TOOLS = {
# Every agent needs these regardless of role
"get_inbox",
"send_message",
"send_to_agent",
"get_my_next_task",
"accept_task",
"complete_task_unverified",
"add_task_progress",
"update_task_status",
"get_task_status",
"list_assigned_tasks",
"report_blocker",
"publish_event",
"discover_agents",
}
MANAGER_ONLY_TOOLS = {
# Only CEO, CTO, and similar leadership roles get these
"verify_task",
"design_agent",
"clone_agent",
"save_agent_profile",
"list_agent_templates",
"schedule_agent_meeting",
"start_agent_meeting",
"end_agent_meeting",
"get_meeting_status",
"send_meeting_message",
"get_meeting_messages",
"create_task_tree",
"get_task_tree",
"assign_task",
"delegate_task",
"list_running_agents",
"restore_from_archive",
}
The filtering logic:
def filter_tools_for_role(all_tools: list[dict], role_id: str) -> list[dict]:
"""Return the tool subset appropriate for this role."""
is_manager = role_id in ("ceo", "cto")
minimal_mode = os.environ.get("MINIMAL_TOOLS_MODE", "").lower() == "true"
extended = os.environ.get("ENABLE_EXTENDED_TOOLS", "").lower() == "true"
if is_manager and not minimal_mode:
return all_tools # Managers keep the full 90-tool set
filtered = []
for tool in all_tools:
name = tool["name"]
if name in MANAGER_ONLY_TOOLS:
continue # Strip manager tools from IC agents
if minimal_mode and name not in ESSENTIAL_TOOLS:
continue # In minimal mode, only essentials
filtered.append(tool)
if extended:
# Re-add specific tools for agents that need them situationally
extended_tools = _get_extended_tools(role_id)
filtered.extend(extended_tools)
return filtered
Three modes of operation:
| Mode | Controlled by | Tool count | Use case |
|---|---|---|---|
| Full | Manager role (CEO/CTO) | ~90 | Leadership agents that delegate, verify, and orchestrate |
| Standard IC | Non-manager role | ~70 | Individual contributor agents (marketing, fullstack, backend) |
| Minimal | MINIMAL_TOOLS_MODE=true | ~15 | Sub-agents and short-lived specialist runs |
The ENABLE_EXTENDED_TOOLS variable lets you selectively re-add tools for specific roles without going back to the full set. If the marketing agent needs one extra tool for a specific workflow, you add it to the extended set for that role instead of giving every agent the full 90.
The impact
Dropping from 90 to 70 tools for IC agents freed approximately 20% of the tool-definition context overhead. That is 20% more context available for actual reasoning, task state, and working memory. It is not a dramatic number in isolation, but context is a budget, and every percentage point of headroom compounds across a full work session.
The more important effect is qualitative. Agents with fewer, more relevant tools make better tool-selection decisions. When the marketing agent sees 70 tools instead of 90, and the 20 it lost are tools it would never correctly use anyway, it spends less reasoning on "should I use clone_agent here?" The answer was always no. Now it does not have to figure that out.
We also observed fewer tool-use errors after filtering. Agents occasionally called manager-only tools they did not have permission to execute, which produced error responses that consumed context and confused the reasoning chain. Removing the tools from the agent's view eliminated that failure mode entirely.
Putting It Together
These two systems — pull-based discovery and role-based filtering — address different problems but share a design philosophy: reduce what the agent has to deal with, and make the remaining pieces crash-proof.
Pull-based discovery reduces the agent's dependency on infrastructure reliability. NATS can go down. Pods can restart. Nodes can be evicted. The task registry persists through all of it, and the agent reconstructs its workload on startup.
Role-based filtering reduces the agent's cognitive overhead. Fewer irrelevant tools means more context for the problem at hand, fewer selection errors, and faster reasoning.
Neither is architecturally novel. A directory of JSON files is not a breakthrough. An allowlist filter on a tool set is not a research contribution. But running AI agents in production is less about architectural novelty and more about eliminating the hundred small failure modes that turn a capable model into an expensive idle process.
The pattern we keep finding is that agent reliability comes from the same principles as service reliability: design for failure, pull instead of push, shed unnecessary load, and make state durable. The agents themselves are smart enough. The infrastructure around them just needs to stop getting in the way.
Build your own cyborgenic organization at agent.ceo.