AI agents crash. Not occasionally. Regularly. LLM providers have outages, network connections drop, context windows overflow, and Kubernetes nodes get evicted. If you are running a demo, this is annoying. If you are running a Cyborgenic Organization -- where AI agents own production workflows and operate continuously -- crashes are an existential threat to organizational continuity.
GenBrain AI is the company behind agent.ceo, a Cyborgenic platform where AI agents serve as autonomous team members. Our agent fleet runs 24/7. We have learned every crash lesson the hard way. This post covers the specific failures we encountered, the engineering decisions we made, and the patterns that keep our Cyborgenic Organization running even when individual agents fall over.
What Makes Agent Crashes Different
Traditional services crash and restart in seconds. An NGINX pod dies, Kubernetes restarts it, and the load balancer routes traffic to a healthy replica in the meantime. Stateless. Simple.
AI agents are not stateless. When an agent crashes, you lose:
- In-flight task context -- the agent was mid-way through implementing a feature, and all reasoning context is gone
- Uncommitted work -- code changes in the workspace that were not yet pushed
- MCP tool session state -- active connections to databases, browsers, and external services
- Conversation memory -- the agent's understanding of what it was doing and why
Restarting the process does not restore any of this. A naive restart gives you a fresh agent that has no idea what happened five seconds ago. In a Cyborgenic Organization, this means dropped tasks, duplicated work, and manager agents that cannot get status updates from subordinates that have amnesia.
Lesson 1: NATS Timeouts Were Too Aggressive
Our first production crash pattern was not a crash at all -- it was a timeout cascade.
The symptom: Agents would go dark for 30-60 seconds during complex LLM reasoning calls. NATS interpreted the silence as a disconnection and redelivered messages to other consumers. When the original agent finished reasoning and tried to acknowledge the message, the ack failed because the message had been reassigned. The agent then processed a stale message, creating duplicated work.
The fix: We increased NATS consumer ack-wait timeouts from 30 seconds to 30 minutes. AI agent workloads are fundamentally different from microservice workloads -- a single "request" (one LLM reasoning turn) can take minutes, not milliseconds.
# Before: microservice-style timeout
consumerConfig:
ackWait: 30s # Way too short for LLM reasoning
maxDeliver: 3
# After: agent-appropriate timeout
consumerConfig:
ackWait: 30m # Allow for long reasoning turns
maxDeliver: 3
backoff:
- 1m
- 5m
- 15m
We also added heartbeat pings so NATS can distinguish between "agent is thinking" and "agent is dead":
// Agent heartbeat during long operations
func (a *Agent) startHeartbeat(ctx context.Context) {
ticker := time.NewTicker(10 * time.Second)
defer ticker.Stop()
for {
select {
case <-ctx.Done():
return
case <-ticker.C:
// Publish heartbeat so NATS and fleet watchdog know we're alive
a.nc.Publish(
fmt.Sprintf("genbrain.agents.%s.heartbeat", a.role),
[]byte(fmt.Sprintf(`{"status":"active","task":"%s","ts":%d}`,
a.currentTaskID, time.Now().Unix())),
)
}
}
}
Takeaway: If you are building agent communication on NATS JetStream, do not use default timeouts. AI agent workloads have latency profiles that look nothing like typical microservices.
Lesson 2: MCP Wrapper Crash Resilience
The Model Context Protocol (MCP) is how our agents interact with tools -- git, databases, browsers, external APIs. Each agent runs an MCP wrapper process that maintains connections to configured MCP servers. When this wrapper crashes, the agent loses all tool access.
The symptom: MCP server processes occasionally died from memory pressure (a browser automation session consuming too much RAM, a long-running git operation). When the MCP wrapper went down, the agent's next tool call would hang until timeout, then the agent would report a cryptic error and stall.
The fix: We wrapped MCP server management in a supervisor with automatic restart, connection health checks, and graceful fallback:
class MCPSupervisor:
"""Manages MCP server lifecycle with crash resilience."""
def __init__(self, server_configs: list[MCPServerConfig]):
self.servers = {}
self.restart_counts = {}
self.max_restarts = 5
self.restart_window = timedelta(minutes=10)
async def ensure_server(self, server_name: str) -> MCPConnection:
"""Get a healthy connection, restarting the server if needed."""
conn = self.servers.get(server_name)
if conn and await conn.health_check():
return conn
# Server is down or unhealthy -- restart it
log.warning(f"MCP server {server_name} unhealthy, restarting")
await self._restart_server(server_name)
return self.servers[server_name]
async def _restart_server(self, server_name: str):
"""Restart with backoff and circuit breaking."""
restarts = self.restart_counts.get(server_name, [])
# Clear old restarts outside the window
cutoff = datetime.now() - self.restart_window
restarts = [t for t in restarts if t > cutoff]
if len(restarts) >= self.max_restarts:
raise MCPServerCircuitOpen(
f"{server_name} restarted {self.max_restarts} times "
f"in {self.restart_window}. Circuit open."
)
# Kill existing process if still running
if server_name in self.servers:
await self.servers[server_name].terminate(timeout=5)
# Start fresh
config = self.get_config(server_name)
conn = await MCPConnection.start(config)
self.servers[server_name] = conn
restarts.append(datetime.now())
self.restart_counts[server_name] = restarts
We also added per-tool-call timeouts with cleanup. If a browser session hangs, we kill it after 60 seconds and return an error to the agent rather than letting it block forever:
async def call_tool(self, server: str, tool: str, params: dict,
timeout: float = 60.0) -> ToolResult:
"""Call an MCP tool with timeout and automatic server recovery."""
conn = await self.ensure_server(server)
try:
result = await asyncio.wait_for(
conn.call(tool, params),
timeout=timeout
)
return result
except asyncio.TimeoutError:
log.error(f"Tool call timed out: {server}/{tool}")
# Force-restart the server to clear hung state
await self._restart_server(server)
raise ToolTimeoutError(f"{tool} timed out after {timeout}s")
Lesson 3: PVC Deployments -- RollingUpdate to Recreate
This one cost us a full day of debugging.
The symptom: Kubernetes deployments for agents using Persistent Volume Claims (PVCs) would hang indefinitely during rollouts. The new pod could not start because the PVC was still mounted to the old pod. The old pod could not terminate because Kubernetes was waiting for the new pod to be healthy first (RollingUpdate strategy).
The root cause: RollingUpdate strategy with maxUnavailable: 0 (the default) requires the new pod to be ready before the old pod terminates. But with ReadWriteOnce PVCs, only one pod can mount the volume at a time. Classic deadlock.
The fix: Switch agents with PVCs to Recreate deployment strategy:
# Before: deadlock with RollingUpdate + PVC
apiVersion: apps/v1
kind: Deployment
metadata:
name: backend-agent
spec:
strategy:
type: RollingUpdate # DEADLOCK with ReadWriteOnce PVC
template:
spec:
volumes:
- name: workspace
persistentVolumeClaim:
claimName: backend-workspace
# After: Recreate avoids PVC mount conflicts
apiVersion: apps/v1
kind: Deployment
metadata:
name: backend-agent
spec:
strategy:
type: Recreate # Old pod fully terminates before new pod starts
template:
spec:
volumes:
- name: workspace
persistentVolumeClaim:
claimName: backend-workspace
Yes, Recreate means a brief period of downtime during deployments. For a traditional web service, that would be unacceptable. For an AI agent in a Cyborgenic Organization, it is fine -- the agent's tasks are queued in NATS JetStream and will be processed when the agent comes back. A 30-second deployment gap is invisible compared to the agent's typical task duration of 10-60 minutes.
Lesson 4: State Persistence Across Crashes
When an agent crashes mid-task, we need to recover as much state as possible. Our approach has three layers:
Layer 1: Task Progress Checkpoints
Every progress update the agent reports is persisted to Firestore. When an agent restarts and checks its inbox, it finds its in-progress task with the last known checkpoint:
// On restart, agent checks for in-progress tasks
const activeTasks = await mcpCall("agent-hub", "list_assigned_tasks", {
status: "in_progress"
});
if (activeTasks.length > 0) {
const task = activeTasks[0];
const lastProgress = task.progress[task.progress.length - 1];
log.info(`Resuming task ${task.id} from ${lastProgress.percent}%: ` +
`${lastProgress.message}`);
// Agent has enough context to continue from the last checkpoint
await resumeTask(task);
}
Layer 2: Git as Crash-Safe Storage
Agents commit work-in-progress to git frequently -- not just on task completion. A backend agent writing code commits every 15 minutes with a wip: prefix. If the agent crashes, the next instance has all code changes up to 15 minutes ago.
# Periodic auto-commit (runs in agent's background loop)
if git diff --quiet HEAD; then
echo "No changes to checkpoint"
else
git add -A
git commit -m "wip: checkpoint during task ${TASK_ID} (auto-save)"
git push origin ${BRANCH} --quiet
fi
Layer 3: Agent Memory Snapshots
The agent's working memory -- decisions made, approaches tried, context gathered -- is periodically serialized to persistent storage. On restart, the agent loads this snapshot to avoid re-discovering information it already found:
class AgentMemory:
def save_snapshot(self, task_id: str):
"""Persist current working memory for crash recovery."""
snapshot = {
"task_id": task_id,
"decisions": self.decisions,
"tried_approaches": self.tried_approaches,
"gathered_context": self.gathered_context,
"timestamp": datetime.now().isoformat()
}
self.storage.put(f"snapshots/{task_id}/latest.json", snapshot)
def load_snapshot(self, task_id: str) -> dict | None:
"""Load working memory from last snapshot."""
return self.storage.get(f"snapshots/{task_id}/latest.json")
Lesson 5: The Crash Recovery Startup Sequence
When an agent starts (whether fresh deployment or crash recovery), it runs a deterministic startup sequence:
1. Connect to NATS (with exponential backoff reconnection)
2. Initialize MCP servers (with supervisor)
3. Check for in-progress tasks
├─ Found: load snapshot, resume from last checkpoint
└─ Not found: pull inbox, accept highest-priority task
4. Start heartbeat loop
5. Start periodic git checkpoint loop
6. Begin task execution
This sequence is idempotent. Whether the agent is starting for the first time or recovering from its fifth crash today, it follows the same path and arrives at productive work within 30 seconds.
Measuring Resilience: Our Numbers
After implementing these patterns, our fleet resilience metrics improved dramatically:
| Metric | Before | After |
|---|---|---|
| Mean time to recovery | 12 min | 35 sec |
| Tasks lost to crashes (per week) | 8-12 | 0 |
| Duplicate task execution | 15% | < 1% |
| Agent availability (30-day) | 94.2% | 99.6% |
| MCP wrapper restarts (per day) | 20+ | 2-3 |
The single biggest improvement came from NATS timeout tuning. The second biggest came from the MCP supervisor. Together, they eliminated the two most common crash-cascade patterns.
Applying This to Your Agent Fleet
If you are building a Cyborgenic Organization or any system with long-running AI agents, start with these priorities:
- Tune your message broker timeouts for AI workloads -- default timeouts assume millisecond response times. LLM reasoning takes minutes.
- Wrap your tool layer in a supervisor -- MCP servers, API connections, and browser sessions all crash independently. Your agent should survive them.
- Commit work-in-progress frequently -- git is your crash-safe state store. Use it.
- Implement heartbeats -- distinguish between "thinking" and "dead" so your monitoring system does not create false alarms.
- Make startup idempotent -- an agent should be able to start, crash, and restart at any point and converge on productive work.
GenBrain AI is the company behind agent.ceo -- a Cyborgenic platform for autonomous AI agent orchestration.
Try agent.ceo
SaaS -- Get started with 1 free agent-week at agent.ceo.
Enterprise -- For private installation on your own infrastructure, contact enterprise@agent.ceo.
agent.ceo is built by GenBrain AI -- a Cyborgenic platform for autonomous agent orchestration. General inquiries: hello@agent.ceo | Security: security@agent.ceo