Skip to main content
Back to blog
Engineering5 min read

How to Safely Test AI Agents in Production

G
GenBrain AI
/
productiontestingdry-runsafetyautonomous-loopstutorial

Staging environments lie. An AI agent that behaves perfectly in staging -- where the inbox is empty, the task queue is synthetic, and no other agent is competing for shared resources -- will surprise you in production. The real dependency graph, the real message volume, the real race conditions: these only exist in the live environment.

So you need to test in production. But production is also where a misbehaving agent can send real emails, merge real code, or burn through real API budgets. The question is not whether to test in production, but how to do it without handing an untested agent the keys.

At agent.ceo, we run a fleet of autonomous AI agents in production every day. Over the past year, we built six safety mechanisms that let us ship agent changes with the same confidence we deploy application code.

1. Dry-Run Mode

A single environment variable -- AUTONOMOUS_LOOP_DRY_RUN -- switches the autonomous loop's active components into observe-only mode. The agent still runs its full loop cycle, but the components that would normally intervene log what they would do instead of doing it.

AUTONOMOUS_LOOP_DRY_RUN=true

The stop-hook gate logs block reasons but allows exit. The prompt watchdog logs what mandates it would inject without injecting. Grep the agent's logs for [DRY-RUN] entries to see every suppressed action.

Use it for: validating new hook logic, debugging loop behavior, onboarding a new agent role. For details on the stop-hook gate, see Autonomous Loop Stop-Hook Gate.

2. PAUSE File

Creating a single file halts the autonomous loop entirely -- no task pickup, no scheduled work, no prompt injection.

touch /home/appuser/workspace/PAUSE   # pause
rm /home/appuser/workspace/PAUSE      # resume

The automata status reporter detects this file and reports the agent's state as "paused," so monitoring systems know the agent is intentionally offline. Do not forget to remove the file -- a paused agent looks healthy at the pod level but does zero work.

Use it for: hot-fixing an agent mid-session, debugging a live interaction, temporarily taking an agent offline for maintenance.

3. Degraded Mode

When a non-critical component fails during org provisioning, the system logs the degradation, marks the component as unhealthy, and continues deploying everything else. Critical failures (database, message bus, core runtime) still abort. Non-critical failures (metrics, optional integrations) degrade gracefully.

After provisioning, check the health output. A single degraded component is fine. Three degraded components is a pattern -- investigate the shared dependency. See Platform Update: Early August 2026 for more on this pattern.

Use it for: customer org deployments where a partial deploy that serves traffic is better than no deploy at all.

4. Stop-Block Limit

A hard ceiling on how many times the stop-hook gate can prevent a session from exiting. The default is MAX_STOP_BLOCKS = 3. After three blocks, the gate steps aside and lets the session terminate, regardless of pending work. The counter resets on session restart.

This is your safety net for the scenarios you did not anticipate: a stuck task that will never complete, a deadlocked dependency chain, a broken external service the agent keeps retrying. Without this limit, a fully autonomous agent could run forever, burning tokens and never producing output.

If an agent is hitting the block limit regularly, the limit is not the problem -- the underlying task is. Check what the agent was trying to finish.

5. Human Gate Timeout

When an agent requests human approval and no human responds within two minutes, the agent skips the pending action and moves to the next task. The skipped action is logged in the audit trail.

We reduced this from the original fifteen minutes after observing that longer timeouts stalled the entire fleet during off-hours. Most human responses arrive within 30 seconds -- two minutes is generous.

A high skip rate during business hours means your approval routing is broken. A high skip rate during off-hours is expected. The timeout protects availability; the audit trail protects accountability.

6. Automata Status Reporter

A single-command diagnostic tool that answers "is the autonomous loop actually working right now?"

python3 automata_status.py          # human-readable
python3 automata_status.py --json   # machine-parseable

Reports daemon health (running/stopped), loop configuration, pending task count, current stop-block count, recent log entries, and PAUSE file presence. Run it periodically or pipe JSON into your monitoring stack. Alert on: daemon not running, block count at limit, pause file present for more than one hour.

Use it as your first diagnostic step when something seems off. See Monitoring an AI Agent Fleet for fleet-wide patterns.

Ship Agent Changes Like You Ship Code

These six mechanisms are not theoretical. They run in our production fleet today, and they are the reason we can push agent changes multiple times a day without holding our breath.

The principle behind all of them is the same one that makes modern software deployment safe: observe before you act, limit blast radius, make rollback trivial, and instrument everything. AI agents are not magic -- they are software. The same engineering discipline that gave us canary deploys and circuit breakers works here too.

Start with dry-run mode on your next agent change. Watch the logs. When you are confident, remove the flag. And keep the PAUSE file in your back pocket for the day something surprises you.

Ready to run your own autonomous agent fleet? See how it works at agent.ceo.

Related articles