Your AI agent finishes a task, compacts its context, and dies. You restart it manually. It does one more thing, compacts again, and dies again. You restart it again. This is the default experience for anyone running LLM-based agents in production, and it is miserable.
The core problem is straightforward: most agent frameworks treat each invocation as a one-shot. The agent receives a prompt, does work, and exits. If you want continuous operation, you are left duct-taping cron jobs to shell scripts, hoping the agent picks up where it left off, and checking Slack at 2 AM to see if the whole thing fell over.
We got tired of this. So we built autonomous loop infrastructure that lets our agents self-sustain indefinitely — waking themselves up after compaction, blocking premature exits, running periodic work cycles, and reporting their own health. This is the system that runs our production fleet of AI agents at agent.ceo today.
Here is how to build it yourself.
The Architecture: Five Components
A self-pacing autonomous loop needs five cooperating pieces:
- Wakeup Hook — re-injects a prompt after context compaction so the agent does not go silent
- Prompt Watchdog — monitors the agent session and injects standing mandates (directives the agent must always follow) when the agent would otherwise idle
- Stop Gate — intercepts exit signals and blocks premature shutdown when the agent still has work to do
- Scheduled Loop — runs periodic work cycles on a timer, independent of the agent's own pacing
- Health Reporter — exposes loop status so you can monitor the fleet without guessing
None of these is complicated on its own. The power comes from how they compose. Let us walk through each one.
Step 1: The Wakeup Hook
When an LLM agent compacts its context (summarizing old conversation to free up tokens), it often loses its sense of what to do next. The wakeup hook fires immediately after compaction and re-injects a minimal prompt.
# agent_wakeup.py — post-compaction wakeup hook
def on_compaction_complete(agent_session):
"""Fire after context compression to keep the agent alive."""
inbox = agent_session.get_inbox()
pending_tasks = agent_session.get_active_tasks()
if pending_tasks:
prompt = f"You just compacted. You have {len(pending_tasks)} active tasks. "
prompt += f"Next task: {pending_tasks[0].summary}. Continue."
else:
prompt = "You just compacted. Check your inbox for new work."
agent_session.inject_prompt(prompt)
The key insight: the wakeup prompt must be concrete. "Continue working" is too vague after compaction — the agent has lost most of its prior context. Tell it exactly how many tasks remain and which one is next.
Step 2: The Prompt Watchdog
The watchdog runs as a background process monitoring the agent's session. When it detects the agent is idle (no output for N seconds), it injects standing mandates — persistent directives that define the agent's autonomous behavior.
# prompt_watchdog.py — inject standing mandates on idle
class PromptWatchdog:
def __init__(self, agent_session, idle_threshold_sec=120):
self.session = agent_session
self.idle_threshold = idle_threshold_sec
self.dry_run = os.environ.get("AUTONOMOUS_LOOP_DRY_RUN") == "true"
def check_and_inject(self):
if self.session.seconds_since_last_output() < self.idle_threshold:
return
mandate = self.load_standing_mandate()
if self.dry_run:
log.info(f"[DRY-RUN] Would inject: {mandate[:80]}...")
return
self.session.inject_prompt(mandate)
log.info("Injected standing mandate after idle timeout")
Notice the AUTONOMOUS_LOOP_DRY_RUN flag. This is critical for testing. You do not want to discover your watchdog is injecting garbage prompts in production. Run it in dry-run mode first, inspect the logs, then flip it live.
Step 3: The Stop Gate
This is the component that surprised us the most in terms of impact. Without it, an agent that finishes one task will simply exit — even if it has three more tasks in its queue. The stop gate intercepts exit signals and blocks them when active work remains.
# autonomous_stop.py — block premature exits
class StopGate:
MAX_BLOCKS_PER_SESSION = 3
def __init__(self, agent_session):
self.session = agent_session
self.block_count = 0
def on_exit_requested(self):
active_tasks = self.session.get_active_tasks()
if not active_tasks:
return ExitDecision.ALLOW
if self.block_count >= self.MAX_BLOCKS_PER_SESSION:
log.warning("Max blocks reached. Allowing exit to prevent stuck session.")
return ExitDecision.ALLOW
self.block_count += 1
log.info(f"Blocked exit ({self.block_count}/{self.MAX_BLOCKS_PER_SESSION}). "
f"Active tasks: {len(active_tasks)}")
return ExitDecision.BLOCK
The MAX_BLOCKS_PER_SESSION cap is essential. Without it, a misbehaving agent with a task it can never complete will loop forever. We set ours to 3 — enough to catch premature exits, not enough to create an unkillable process. The counter resets at session start via the wrapper script.
Step 4: The Scheduled Loop
The scheduler runs periodic work cycles independent of the agent's own initiative. Think of it as a cron job, but one that is aware of the agent's state.
#!/bin/bash
# scheduled_loops.sh — periodic work cycles
LOOP_INTERVAL=${LOOP_INTERVAL:-300} # 5 minutes default
while true; do
if agent_is_idle; then
inject_work_cycle "$AGENT_SESSION_ID"
log "Triggered scheduled work cycle"
else
log "Agent busy, skipping cycle"
fi
sleep "$LOOP_INTERVAL"
done
The scheduler checks whether the agent is already busy before injecting. Stacking prompts on a working agent creates confusion and wasted tokens. If the agent is mid-task, skip the cycle and try again next interval.
Step 5: The Health Reporter
You cannot manage what you cannot measure. The health reporter exposes loop status as structured data that your monitoring system can scrape.
# automata_status.py — loop health diagnostics
def get_loop_status(agent_session):
return {
"agent_id": agent_session.agent_id,
"loop_active": True,
"stop_gate_blocks": agent_session.stop_gate.block_count,
"last_wakeup_utc": agent_session.last_wakeup_time.isoformat(),
"watchdog_injections_total": agent_session.watchdog.injection_count,
"active_tasks": len(agent_session.get_active_tasks()),
"session_uptime_sec": agent_session.uptime_seconds(),
"dry_run": os.environ.get("AUTONOMOUS_LOOP_DRY_RUN") == "true"
}
We expose this via an HTTP endpoint that our fleet monitoring dashboard polls. When stop_gate_blocks hits 3 on multiple agents simultaneously, something systemic is wrong. When watchdog_injections_total climbs without corresponding task completions, the agent is spinning without making progress.
Gotchas That Will Bite You
Infinite loops. The most dangerous failure mode. An agent that keeps retrying a failing task will burn tokens forever. We enforce an anti-loop rule: if the same action fails 5 or more times consecutively, the agent must stop, mark the task as blocked, and escalate. Build this into your agent's core instructions, not just the loop infrastructure.
Cost runaway. Every watchdog injection and wakeup prompt costs tokens. A 5-minute scheduler on 10 agents is 2,880 injections per day. Most will be no-ops (agent is already busy), but monitor your token spend closely in the first week. We burned through a surprising amount before tuning our idle thresholds.
Dry-run everything first. We added AUTONOMOUS_LOOP_DRY_RUN after a watchdog bug injected the same mandate 47 times in one session. The agent dutifully tried to execute the same task 47 times. Set dry-run to true, let it run for 24 hours, inspect the logs, then go live.
Session state persistence. The stop gate counter must reset on fresh sessions but persist within a session. If you store it in an environment variable, it will reset on every subprocess call. Use a file or in-memory state that the wrapper script initializes once.
Graceful degradation. If the watchdog process crashes, the agent should still function — it just will not self-pace. Design each component to fail independently. A dead health reporter should never kill the agent loop.
Putting It All Together
The full initialization sequence looks like this:
# Session startup
reset_stop_gate_counter()
start_watchdog(idle_threshold=120, dry_run=DRY_RUN)
start_scheduled_loop(interval=300)
register_health_endpoint()
send_heartbeat_to_manager()
inject_startup_directive()
In production, our agents run this infrastructure 24/7. They wake up, check their inbox, work through tasks, compact when context fills up, wake themselves back up, and keep going. The stop gate catches premature exits. The watchdog catches idle drift. The scheduler ensures periodic housekeeping. The health reporter lets us sleep at night.
We open-sourced the patterns and run them on our own fleet. If you want to see autonomous agents in action — agents that manage their own lifecycle, verify their own work, and never need a manual restart — check out what we are building at agent.ceo.
The era of babysitting AI agents is over. Build the loop. Let them run.