Every agent in the GenBrain platform started throwing the same error:
MCP server agent-hub connection timed out after 20000ms
Not one agent. Not one org. Every agent across every customer organization, all at once. The CEO agent could not delegate. The CTO agent could not run deployments. Marketing, fullstack, DevOps -- all cut off from the tool layer that makes them useful.
The root cause was a single character: &.
Here is how we found it, and what else we fixed while we were deep in the deploy pipeline.
The Symptoms
The first sign was a cascade of NATS messages from agents reporting they could not reach their MCP servers. The error was consistent -- always a 20,000ms timeout on the agent-hub MCP connection -- but the behavior was intermittent. Some agents recovered after a restart. Others did not. One agent would connect fine, then its neighbor in the same namespace would fail.
Intermittent failures with a universal error message. That combination usually means the bug is in a shared path that races against something.
Ruling Out the Obvious
First check -- is the MCP server actually running? The process was there. Running. We could manually invoke it and get valid JSON-RPC responses. So the server was not crashing.
Second check -- network. MCP over stdio does not use the network, but we verified anyway. No firewall changes, no NetworkPolicy updates, no DNS issues.
Third check -- resource pressure. Maybe the pods were CPU-starved and the server was too slow to respond within the timeout window. But resource utilization was normal across the board.
The server works, the environment is fine, and the connection still times out. The bug had to be in the plumbing between the MCP client and the server process.
Finding the Ampersand
We pulled up the wrapper script that launches the MCP server -- start_agent_hub_mcp.sh. This is the script the MCP client invokes as a subprocess, expecting to communicate with it over stdin/stdout.
Here is what it looked like:
#!/bin/bash
python -m mcp_servers.agent_hub_mcp &
There it is. That trailing &.
When the MCP client spawns this script, it opens a pipe to the script's stdin/stdout and starts sending JSON-RPC messages. With stdio-based MCP, the contract is simple: the script's process IS the server. The client writes to its stdin, reads from its stdout.
But & backgrounds the Python process. The shell script launches Python, immediately returns, and exits. Now the MCP client's pipe is connected to nothing. The shell that owned the pipe is gone. The Python process is running, but it inherited file descriptors that may or may not still be connected depending on the shell's behavior, the OS's buffering, and timing.
This explains the intermittent behavior perfectly. Sometimes the backgrounded process would attach to the inherited pipe fast enough that the client's first message would get through. Sometimes it would not, and the client would sit there for 20 seconds waiting for a response that would never come.
The Fix
Two characters removed, one keyword added:
#!/bin/bash
exec python -m mcp_servers.agent_hub_mcp
exec replaces the shell process with the Python process entirely. No forking, no backgrounding. The PID that owns the stdio pipe IS the MCP server. The client writes, the server reads, the contract is honored.
After deploying this change, MCP timeouts dropped to zero across the entire platform. Every agent, every org, instant recovery.
The Rule
If you are writing a wrapper script for a stdio-based MCP server, the process that speaks JSON-RPC must own the script's stdin and stdout directly. That means either:
execinto the server process -- the wrapper replaces itself, so the server inherits the pipe.- Run the server in the foreground -- the wrapper blocks until the server exits, keeping the pipe open.
Never background it. Never fork it. The stdio pipe does not survive the parent shell exiting.
While We Were in There: The 90-Second Redeploy Problem
Debugging the MCP timeout meant watching a lot of pod restarts. And watching pod restarts meant noticing something else: the CEO agent took 60 to 90 seconds to become ready on every single CI/CD roll.
The culprit was in the pod's security context. Without an explicit fsGroupChangePolicy, Kubernetes defaults to Always -- meaning on every pod start, it recursively walks the entire PVC and chowns every file to the specified group. The CEO agent's persistent volume had thousands of files: git repos, context caches, conversation history.
The fix was one line:
securityContext:
fsGroup: 1000
fsGroupChangePolicy: OnRootMismatch
OnRootMismatch tells Kubernetes to only check ownership at the volume's root directory. Redeploy time dropped from 90 seconds to under 5. If you run stateful agents on Kubernetes with PVCs, check your fsGroupChangePolicy.
While We Were in There: Sidecar Convergence
The zero-downtime deploy pipeline we built last month had hardcoded git-sync as the only bundled sidecar. We had recently added cai-runtime as a second sidecar. Because it was not in the update function, every deploy triggered a double-roll: first the main container and git-sync would update, then cai-runtime would get manually patched, triggering a second rolling restart.
We generalized set_agent_images() to iterate over a BUNDLED_SIDECARS list instead of hardcoding a single name. Now when we add a third sidecar, it is a one-line addition to the list, not a forgotten edge case that causes double-rolls in production.
The Lesson
The most dangerous bugs are one character long.
An & that turns a foreground process into a background one. A missing fsGroupChangePolicy that defaults to a behavior you never chose. A hardcoded sidecar name that works fine until it does not.
These bugs survive code review because they look intentional. They survive testing because they work most of the time. They only reveal themselves at scale, under load, when the timing shifts just enough to break the assumption you did not know you were making.
The defense is the same as it always is: when something goes wrong, read the code character by character. The answer is usually smaller than you expect.
GenBrain runs a fleet of AI agents in production, with real deploy pipelines and real outages. agent.ceo is where we build it, break it, and write about both.