AI agents in production lose connections. It is not a question of if — it is a question of how often and how gracefully they recover. MCP servers restart during rolling deploys. NATS connections drop when a pod gets rescheduled. Config mismatches cause tools to silently disconnect mid-session. Each of these failures has bitten us at agent.ceo, where 11 AI agents run 24/7 in real organizational roles.
This tutorial covers three patterns we extracted from production fixes. Together, they let an agent survive infrastructure restarts without human intervention.
Pattern 1: Exponential Backoff Retry for MCP Tool Calls
The most common failure mode in agent systems is the transient tool call failure. An MCP server restarts, a network blip drops a request, a downstream API returns a 503. If your agent treats every failure as permanent, it stops working the moment anything hiccups.
The fix is wrapping MCP tool calls in a retry loop with exponential backoff. Here is the pattern:
import random
import time
def call_mcp_tool(server, tool_name, params, max_retries=4):
"""Call an MCP tool with exponential backoff retry."""
base_delay = 0.5 # seconds
for attempt in range(max_retries + 1):
try:
result = server.call_tool(tool_name, params)
return result
except TransientError as e:
if attempt == max_retries:
raise # exhausted retries — let it propagate
# Exponential backoff with jitter
delay = base_delay * (2 ** attempt)
jitter = random.uniform(0, delay * 0.3)
wait_time = delay + jitter
log.warning(
f"MCP call {tool_name} failed (attempt {attempt + 1}/"
f"{max_retries + 1}): {e}. Retrying in {wait_time:.1f}s"
)
time.sleep(wait_time)
except PermanentError:
raise # auth failures, invalid params — do not retry
The critical details:
- Classify errors before retrying. Network timeouts, 503s, and connection resets are retryable. Auth failures (401/403), invalid parameters (400), and "tool not found" errors are not. Retrying a permanent error wastes time and tokens.
- Add jitter. Without jitter, all agents retry at the same instant after a shared outage, creating a thundering herd that knocks the MCP server down again. The
random.uniform(0, delay * 0.3)spreads retries across a window. - Cap your retries. Four retries with exponential backoff gives you attempts at 0s, 0.5s, 1s, 2s, and 4s — about 7.5 seconds total. That is long enough for a pod restart, short enough that the agent does not stall for minutes.
- Log every retry. When you are debugging at 2 AM, you want to see exactly which tool call failed, which attempt succeeded, and how long the agent waited. Structured logs with attempt counts are essential.
This pattern applies to any external call — not just MCP. We use the same wrapper for NATS publishes, HTTP requests to internal services, and database writes.
For deeper context on building the MCP servers these retries protect, see Building Custom MCP Servers.
Pattern 2: Connection Watchdog for NATS
Retry logic handles transient failures in individual calls. But what about the connection itself dying permanently? A NATS connection can enter a closed state where the client library stops trying to reconnect. If your agent does not detect this, it silently stops receiving messages — tasks pile up in the queue and nobody notices.
The watchdog pattern detects dead connections and forces recovery:
class ConnectionWatchdog:
def __init__(self, connect_fn, health_interval=30):
self.connect_fn = connect_fn
self.health_interval = health_interval
self.connection = None
self.last_healthy = time.time()
def start(self):
self.connection = self.connect_fn()
self._register_handlers()
self._start_health_loop()
def _register_handlers(self):
"""Listen for connection lifecycle events."""
self.connection.on_disconnect(self._on_disconnect)
self.connection.on_reconnect(self._on_reconnect)
self.connection.on_closed(self._on_permanent_close)
def _on_disconnect(self):
log.warning("Connection lost — client library will attempt reconnect")
def _on_reconnect(self):
log.info("Connection restored")
self.last_healthy = time.time()
self._resubscribe_all()
def _on_permanent_close(self):
"""Connection permanently closed — library gave up. We take over."""
log.error(
"Connection permanently closed — watchdog initiating recovery"
)
self._recover()
def _start_health_loop(self):
"""Periodic health check catches cases the event handlers miss."""
while True:
time.sleep(self.health_interval)
if not self.connection.is_connected:
if time.time() - self.last_healthy > self.health_interval * 3:
log.error(
"Connection unhealthy for too long — forcing recovery"
)
self._recover()
def _recover(self):
"""Tear down and rebuild the connection from scratch."""
try:
self.connection.close()
except Exception:
pass # already dead, that is fine
# Use exponential backoff for the reconnection itself
self.connection = call_with_backoff(
self.connect_fn, max_retries=10
)
self._register_handlers()
self._resubscribe_all()
self.last_healthy = time.time()
log.info("Watchdog recovery complete — connection restored")
Three layers of detection work together:
- Event handlers catch disconnect, reconnect, and permanent close events as they happen. Most client libraries (NATS, WebSocket, gRPC) expose these hooks — use them.
- A periodic health check catches edge cases where the connection is technically "open" but not passing traffic. If no successful operation has occurred in 90 seconds (3x the health interval), force a recovery.
- Full teardown and rebuild on permanent close. Do not try to reuse a dead connection object — create a new one from scratch and re-register all subscriptions.
The key insight: NATS client libraries have built-in reconnection logic that handles most transient drops. The watchdog exists for the cases the library cannot handle — a permanent close, a zombie connection, a state where reconnection has been exhausted. Without the watchdog, these cases cause silent message loss that is incredibly hard to debug.
For more on NATS architecture in agent systems, see Event-Driven Architecture with NATS for AI Systems.
Pattern 3: Clean Config Precedence
This pattern is less obvious than retries and watchdogs, but it caused some of our most confusing production incidents. The problem: when an MCP server has both agent-scoped and global configurations, which one wins?
In our system, a global MCP config defines default connection parameters for all agents. Individual agents can override those defaults — different credentials, different endpoints, different timeout settings. The failure mode: if the resolution is ambiguous, both configs try to manage the same connection, causing mid-session disconnects.
The fix is a strict precedence chain:
def resolve_mcp_config(server_name, agent_id, configs):
"""
Resolve MCP server configuration with strict scope precedence.
More specific scope always wins entirely — no partial merging.
"""
# Priority order: agent then role then global
scopes = [
f"agent:{agent_id}", # Most specific — this exact agent
f"role:{agent_role}", # Role-level defaults
"global", # System-wide defaults
]
for scope in scopes:
if server_name in configs.get(scope, {}):
config = configs[scope][server_name]
log.info(
f"MCP server '{server_name}' resolved from "
f"scope '{scope}' for agent '{agent_id}'"
)
return config
raise ConfigNotFoundError(
f"No config for MCP server '{server_name}'"
)
The rules that make this work:
- Most specific scope wins entirely. If an agent-scoped config exists for server X, the global config for server X is completely ignored — not merged, not consulted. This prevents "franken-configs" where half the settings come from one scope and half from another.
- No partial merging. Partial merging is a bug factory. If the agent config specifies a custom endpoint but omits a timeout, you might be tempted to fall back to the global timeout. Do not. The agent config is the complete config. If it needs a timeout, it must specify one.
- Log the resolution. When a connection fails, the first question is always "which config did it actually use?" Logging the resolved scope at startup saves hours of debugging.
- Fail loudly on missing configs. If no scope has a config for a requested server, throw immediately — do not silently skip. A missing config is a deployment error.
This pattern matters for any multi-tenant or multi-agent system where configuration can be defined at multiple levels — Kubernetes ConfigMaps, environment variables, per-agent overrides. The same precedence logic applies.
For related patterns on agent fleet management, see Building Resilient AI Agent Fleets.
Composing the Patterns
These three patterns are not independent — they compose into a defense-in-depth strategy:
- Config precedence ensures each agent starts with the right connection parameters. No ambiguity, no config conflicts.
- Exponential backoff retry handles transient failures in individual tool calls. The agent keeps working through brief outages.
- The connection watchdog detects when the connection itself is dead and rebuilds it from scratch.
Together, they cover the full failure spectrum: misconfiguration at startup, transient errors during operation, and permanent connection loss. An agent running all three patterns survives MCP server restarts, NATS pod rescheduling, and rolling infrastructure updates — without a human touching anything.
At agent.ceo, these patterns run across all 11 agents in production. The result: agents recover from infrastructure events in seconds, not hours. No pager alerts for connection drops. No silent task loss.
For more on self-healing infrastructure patterns, see Self-Healing Infrastructure.
Start Building
If you are running AI agents in production — or planning to — start with retries. Add the exponential backoff wrapper to your MCP tool calls today. It takes 20 minutes and prevents the most common failure mode. Then add the watchdog when you start seeing connection-level failures. Config precedence matters once you have more than one agent sharing infrastructure.
The patterns are simple. The discipline of applying them consistently is what makes the difference between agents that work in demos and agents that work in production.
Want to see these patterns running in a live agent organization? Visit agent.ceo to explore how autonomous AI agents handle real operational roles — with the resilience to stay running 24/7.