Skip to main content
Back to blog
Technical7 min read

How to Debug Mid-Session MCP Disconnections in AI Agent Systems

G
GenBrain AI
/
mcpdebuggingagentsproductiontool-integrationresilience

Your agent was working fine. Then its tools vanished.

No error. No crash log. The agent just stopped being able to call its MCP tools mid-session, as if someone quietly unplugged the cable while it was working. It kept running, kept generating text, but every tool call returned nothing or threw an opaque transport error. The session was effectively brain-dead.

We hit this failure mode twice in one week across our production fleet at GenBrain. Both bugs had different root causes, but they shared the same maddening symptom: tools that worked at session start silently disappeared minutes or hours later. Here is the exact diagnostic checklist we built to find them, and the prevention patterns we now enforce so they do not come back.

The Symptom Pattern

Before diving into diagnostics, confirm you are actually dealing with an MCP disconnection and not something else:

  • The agent started normally and tools worked for the first few minutes.
  • Tool calls begin returning errors, empty responses, or timing out.
  • The agent process itself is still alive and generating output.
  • Restarting the agent temporarily fixes the problem.
  • The failure is intermittent, not every session.

If all five match, you have a mid-session MCP disconnection. Walk through the checklist below in order.

The Five-Step Diagnostic Checklist

Step 1: Check If the MCP Server Is Registered

The most basic failure: the MCP server your agent depends on is not registered at all, or was registered and then removed by a conflicting startup script.

Inspect your active MCP configuration. In Claude Code, check the merged config that the runtime actually sees:

# Check user-scope MCP registrations
cat ~/.claude/settings.json | jq '.mcpServers'

# Check project-scope registrations
cat .claude.json | jq '.mcpServers'

# Check local-scope registrations
cat .claude/settings.local.json | jq '.mcpServers'

You are looking for your MCP server entry with the correct command, args, and environment variables. If it is missing entirely, your startup scripts may be cleaning it up. If it is present but the command path is wrong, the server never started.

Step 2: Check for Scope Conflicts

This is the bug that cost us the most debugging time. MCP servers can be registered at multiple scopes: user, local, and project. When the same server is registered at two scopes with different invocation methods, the runtime may pick the wrong one.

In our case, the agent-hub MCP was registered at user scope via a crash-resilient wrapper script and simultaneously at local scope via direct Python invocation in .claude.json. The wrapper script included automatic restart on failure. The direct invocation did not. When Claude Code resolved the conflict by using the local-scope entry, the MCP server ran without crash recovery. A single memory spike or transient failure killed it permanently for the rest of the session.

Diagnostic: Search all three config locations for the same server name. If you find duplicates, that is your bug.

# Find all MCP registrations across scopes
grep -r "your-mcp-server" ~/.claude/settings.json .claude.json .claude/settings.local.json

Fix: Pick one scope. We chose local scope with the crash-resilient wrapper, and our entrypoint script now explicitly removes stale user-scope registrations from prior versions before the agent starts.

Step 3: Check the Tool Whitelist

Your MCP server is running. The transport is healthy. But the agent still cannot see certain tools. The problem might not be the connection at all. It might be your tool filtering layer.

We run a function called apply_minimal_toolset() that strips non-essential tools from the agent's available set at startup. This reduces noise and keeps agents focused. The problem: when we added knowledge base and wiki tools to the MCP server, we forgot to add them to the ESSENTIAL_TOOLS whitelist. All 17 tools were silently stripped before any agent ever saw them. The knowledge graph was completely inaccessible across the entire fleet, and no error was ever raised.

Diagnostic: If specific tools are missing but the MCP server is healthy, check whether your system has a tool filter, whitelist, or permission layer that runs between MCP registration and tool availability.

# Our ESSENTIAL_TOOLS set went from 96 to 113 after the fix
# Check if your missing tools are in the whitelist
print(tool_name in ESSENTIAL_TOOLS)  # False means silently stripped

Fix: Audit your whitelist every time you add new MCP tools. We now have a CI check that compares registered MCP tools against the whitelist and fails the build if any tool is registered but not whitelisted.

Step 4: Check Connection Resilience

An MCP server that starts successfully but has no crash recovery will eventually die. Memory pressure, a bad request, a downstream timeout. In production, the question is not whether the server process will crash. It is when.

We use a wrapper script that monitors the MCP server process and restarts it automatically on failure. When our dual-scope bug caused agents to bypass this wrapper, servers that crashed stayed dead. The fix had two parts: ensure the wrapper is always used, and make the server's own shutdown path graceful. We replaced os._exit(1) calls in the memory watchdog with raise SystemExit(1), which allows Python's cleanup handlers to run and gives the wrapper a clean exit code to act on.

Diagnostic:

# Check if your MCP server is running under a wrapper
ps aux | grep mcp-server
# Look for: wrapper.sh -> python server.py (good)
# vs: python server.py alone (no crash recovery)

Fix: Always run production MCP servers behind a process supervisor. This can be a simple bash wrapper with a restart loop, systemd, or a container-level restart policy. Direct invocation is for development only.

Step 5: Check for Silent Connection Errors

The transport layer between the agent runtime and the MCP server uses stdio pipes or HTTP. Both can break silently. A BrokenPipeError means the server's stdout pipe closed, usually because the reading end (the agent runtime) terminated. A ConnectionResetError means the TCP connection was dropped by the peer.

Our MCP server was catching BrokenPipeError to handle graceful shutdown but was not catching ConnectionResetError. In certain network conditions, the reset variant would propagate as an unhandled exception and crash the server.

Diagnostic: Check your MCP server's error handling for transport-level exceptions:

# Both of these must be caught at the transport layer
try:
    await send_response(result)
except (BrokenPipeError, ConnectionResetError):
    # Graceful shutdown, not a crash
    logger.info("Client disconnected, shutting down")
    raise SystemExit(0)

Fix: Catch both BrokenPipeError and ConnectionResetError (and consider OSError as a parent class) at every point where your MCP server writes to the transport. Log the event and exit cleanly so the wrapper can restart.

Prevention Checklist

After fixing both bugs, we codified these rules to prevent recurrence:

  • Single-scope registration. Every MCP server is registered at exactly one scope. Startup scripts clean up entries from other scopes.
  • Wrapper-only invocation. Production MCP servers always run behind a crash-resilient wrapper. Direct Python invocation is blocked in production configs.
  • Whitelist audit on every tool addition. CI fails if a registered MCP tool is not in the essential tools set.
  • Comprehensive transport error handling. All pipe and connection errors are caught at the transport layer.
  • Graceful shutdown paths. No os._exit() calls. Always SystemExit so cleanup handlers run.
  • 65+ integration tests covering MCP registration, tool availability, and reconnection scenarios.

The Takeaway

MCP disconnections are insidious because they degrade an agent session without killing it. The agent keeps running, keeps spending tokens, keeps generating output that references tools it can no longer call. Every minute of a disconnected session is wasted compute.

The five-step checklist above covers the failure modes we have seen in production. The root causes are always mundane: a config conflict, a missing whitelist entry, an uncaught exception. But the symptoms are confusing enough that without a systematic approach, you will spend hours staring at logs that do not contain the answer.

If you are building multi-agent systems and want to see how we handle MCP resilience, tool management, and fleet operations at scale, visit agent.ceo and explore how GenBrain runs an entire AI-native organization with autonomous agents in production.

Related articles