Skip to main content
Back to blog
Product16 min read

Self-Healing Infrastructure: The Invisible Systems That Keep AI Agents Running

M
Moshe Beeri, Founder
/
agentsinfrastructureseomcpcrash-recoverykubernetesautomationcyborgenic-organization

Self-Healing Infrastructure: The Invisible Systems That Keep AI Agents Running

Rendering diagram…

TL;DR

  • Automated SEO sitemap submission (GitHub Actions + K8s CronJob fallback) eliminates silent indexing failures.
  • MCP crash recovery with exponential backoff keeps agents' tools alive through transient failures -- about 50 lines of shell script separating a 3-second blip from hours of lost work.
  • Self-healing is what turns an AI demo into a cyborgenic organization that actually runs.

The best infrastructure is invisible. DNS resolution. TLS certificate renewal. Log rotation. These systems run continuously, fail occasionally, recover automatically, and only surface when the self-healing itself breaks.

In a Cyborgenic Organization, self-healing infrastructure is what makes autonomy real rather than aspirational. Agents that depend on human intervention to recover from routine failures are not autonomous teammates -- they are high-maintenance tools with a timer running until the next babysitting session.

I run 11 AI agents in production -- CEO, CTO, DevOps, Fullstack, Marketing, Architect, CFO, CSO, Investment, Org-Agent, and ZiDevops-Director. When I am asleep and the MCP server crashes at 3 AM, there is no one to restart it. The infrastructure must maintain itself. This cycle we shipped two systems that embody that principle: automated SEO sitemap submission and MCP server crash recovery. Neither is glamorous. Neither will show up in a demo. Both solve the same structural problem: things that work fine when a human is watching and fail silently when nobody is.

Our codebase has 83,163 test functions across 2,304 test files and 9,799 git commits. The self-healing code is some of the most important code we have -- and some of the shortest. Here is what we built, how it works, and why self-healing patterns are the difference between a demo and a production system.

Rendering diagram…

Rendering diagram…

SEO Sitemap Automation

The problem nobody noticed

Here is a failure mode that does not trigger alerts, does not crash pods, and does not show up in any dashboard: your sitemap goes stale.

Every time we deploy new content to the agent.ceo marketing site — a blog post, a landing page, an updated product description — search engines need to know about it. The mechanism is sitemap submission: you generate an XML sitemap listing every URL on your site, then you tell Google Search Console "here is the updated map, come re-crawl."

For months, this was a manual step. Someone would remember to submit the sitemap after a deploy. Sometimes. When they did not, new content sat unindexed for days or weeks. The blog post was live. The URL worked. But Google did not know it existed, so organic search traffic to that page was zero until the next scheduled crawl happened to pick it up.

This is the kind of failure that compounds. One missed submission is invisible. Twenty missed submissions over two months means your site's search presence is perpetually stale. You cannot fix it with a one-time script because the problem recurs with every deploy.

The fix: two layers of automation

We built two independent submission paths. Either one is sufficient. Together, they make missed submissions nearly impossible.

Layer 1: GitHub Actions reusable workflow. The file is sitemap-submit.yml, designed as a reusable workflow that any repository can call from its post-deploy step. When the marketing site deploys, the workflow triggers automatically: it fetches the generated sitemap URL, authenticates to the Google Search Console API using the agent-deployer service account, and submits the sitemap.

The reusable workflow design matters. We did not hardcode this into one repository's CI pipeline. Any site we deploy — the main marketing site, the docs site, the blog — can call the same workflow with its own sitemap URL. One implementation, multiple consumers. When we improve the submission logic (retries, error reporting, multi-engine support), every consumer gets the improvement for free.

Layer 2: Kubernetes CronJob at 06:00 UTC. The GitHub Actions workflow covers the deploy-triggered case: new content goes live, sitemap gets submitted immediately. But what about content changes that do not involve a deploy? What about the case where the Actions workflow fails silently because of a transient API error?

A K8s CronJob runs daily at 06:00 UTC as the fallback. It submits the sitemap regardless of whether a deploy happened. Same service account, same API call, same authentication path. If the Actions workflow already submitted it, the duplicate submission is harmless — Google just re-processes the same sitemap. If the Actions workflow missed it, the CronJob catches it within 24 hours.

Both paths authenticate through the agent-deployer service account, which has the minimum required permissions for Search Console API access. No human credentials. No OAuth tokens that expire and require manual refresh. The service account key lives in a Kubernetes secret, rotated on schedule, and both the Actions workflow and the CronJob consume it identically.

Search Console MCP tools

Submitting sitemaps is one direction: pushing information to Google. We also built the reverse: agents can query Google Search Console to check indexing status.

The Search Console MCP tools let any agent in the fleet check which pages are indexed, which have errors, and which are pending. The marketing agent uses this during content audits. Instead of logging into Search Console manually and eyeballing the coverage report, the agent queries the API directly, identifies pages with indexing issues, and either fixes the problem (if it is a content issue) or flags it for infrastructure review (if it is a crawl error).

This closes the loop. Automated submission ensures content reaches Google. MCP tools ensure agents can verify it arrived. No human in the middle for either direction.

MCP Crash Recovery

Rendering diagram…

The failure mode that kills agent sessions

MCP (Model Context Protocol) servers are how our agents access tools. Every capability an agent has — reading its inbox, updating a task, querying the discovery engine, sending a message — routes through an MCP server. If the MCP server goes down, the agent loses all of its tools. Not some tools. All tools. The agent can still reason, but it cannot act. It becomes an expensive process that thinks very hard about problems it cannot solve.

Our previous implementation was a bare exec call. The wrapper script launched the MCP server process and moved on. If the process crashed -- segfault, unhandled exception, OOM-kill, anything -- it was gone. No restart. No recovery. The agent session would eventually fail when every tool call returned an error, and someone would have to investigate why the marketing agent had been sitting idle for three hours.

The OOM-kill case was especially painful. MCP servers are long-running Python processes that accumulate state. So I built a memory watchdog that runs as a background asyncio task inside every MCP server process. Here is the real code from conductor/src/mcp_servers/memory_watchdog.py:

# From conductor/src/mcp_servers/memory_watchdog.py — real production code
"""Memory watchdog — monitors RSS and triggers graceful shutdown at threshold.

Designed to run as a background asyncio task inside the MCP server process.
When memory exceeds the configured limit, it logs a critical warning and
raises SystemExit so the crash-recovery loop in main() can restart cleanly.
"""

RSS_WARN_MB = int(os.environ.get("MCP_RSS_WARN_MB", "100"))
RSS_LIMIT_MB = int(os.environ.get("MCP_RSS_LIMIT_MB", "140"))
CHECK_INTERVAL_S = int(os.environ.get("MCP_MEMORY_CHECK_INTERVAL", "30"))


def get_rss_mb() -> float:
    """Return current RSS in megabytes (Linux: reads /proc/self/status)."""
    try:
        with open("/proc/self/status") as f:
            for line in f:
                if line.startswith("VmRSS:"):
                    return int(line.split()[1]) / 1024.0
    except (OSError, ValueError):
        pass
    usage = resource.getrusage(resource.RUSAGE_SELF)
    return usage.ru_maxrss / 1024.0


async def memory_watchdog(
    warn_mb: int = RSS_WARN_MB,
    limit_mb: int = RSS_LIMIT_MB,
    interval: int = CHECK_INTERVAL_S,
) -> None:
    """Background task that monitors memory and exits when over limit."""
    warned = False
    while True:
        await asyncio.sleep(interval)
        rss = get_rss_mb()
        if rss >= limit_mb:
            logger.critical(
                "MCP server RSS %.1fMB exceeds limit %dMB — triggering graceful restart",
                rss, limit_mb,
            )
            os._exit(1)
        elif rss >= warn_mb and not warned:
            logger.warning("MCP server RSS %.1fMB approaching limit %dMB", rss, limit_mb)
            warned = True
        elif rss < warn_mb:
            warned = False

That os._exit(1) is deliberate. I do not want a graceful Python shutdown that might hang on finalizers. I want the process dead immediately so the crash-recovery loop can restart it in a known-clean state. The warn-at-100MB, kill-at-140MB thresholds give agents about 30 seconds of warning before the restart -- enough time to finish a tool call, not enough time to accumulate more garbage.

This happened more often than you would expect. MCP servers are long-running processes that handle JSON-RPC over stdio. They accumulate state. They interact with external APIs that can return unexpected responses. They are not inherently fragile, but any process that runs for hours will eventually encounter a condition its author did not anticipate. The question is not whether it will crash, but what happens when it does.

The fix: auto-restart with exponential backoff

Rendering diagram…

The new implementation wraps the MCP server in a restart loop with exponential backoff, failure tracking, and signal handling. Here is the design.

Restart loop. When the MCP server exits unexpectedly, the wrapper restarts it automatically. The agent session continues. From the agent's perspective, there is a brief window — typically under five seconds — where tool calls fail. Then the server is back, and tools work again. The agent retries the failed call and keeps working. In most cases, the agent does not even notice the interruption.

Exponential backoff. Naive restart loops are dangerous. If the MCP server crashes because of a persistent condition (corrupted state file, misconfigured environment variable, incompatible dependency), restarting it immediately will just crash it again. And again. And again. Hundreds of times per minute, filling logs and consuming CPU.

The backoff starts at 2 seconds and doubles on each consecutive failure, capping at 60 seconds. First crash: wait 2 seconds, restart. Second crash: wait 4 seconds. Third: 8 seconds. This gives transient issues time to resolve (a brief network partition, a momentary memory spike) while preventing runaway restart storms for persistent failures.

Failure tracking. The restart loop tracks consecutive failures using shared state in common.sh. After 10 consecutive failures without a successful startup, the system enters an extended backoff period. At that point, the issue is almost certainly not transient — something is structurally wrong, and hammering the restart loop will not fix it. The extended backoff reduces system load while preserving the ability to recover if the underlying issue resolves (for example, if a dependent service comes back online).

Signal handling. The restart loop handles SIGTERM and SIGINT for graceful shutdown. When Kubernetes sends SIGTERM to drain a pod, the wrapper catches it, forwards it to the MCP server process, waits for clean exit, and then exits itself. Without this, pod termination during an MCP restart window could leave orphan processes or corrupt state files.

Structured logging. Every restart event -- the crash, the backoff duration, the restart attempt, the success or failure of the new process -- logs to /agent-data/logs/mcp_server.log with timestamps and severity levels. When something does go wrong enough to require human investigation, the log tells the full story: when the crashes started, how often, what the backoff progression looked like, and whether recovery succeeded.

The environment configuration that drives this comes from agent_hub_mcp.py -- the same 190-function file that powers all 11 agents' MCP tools:

# From conductor/src/mcp_servers/agent_hub_mcp.py — real production config
NATS_URL = os.environ.get("NATS_URL", "nats://nats:4222")
OPERATOR_ID = os.environ.get("OPERATOR_ID", "default-operator")
ROLE_ID = os.environ.get("ROLE_ID", "agent")
AGENT_REGISTRY_URL = os.environ.get("AGENT_REGISTRY_URL", "http://agent-registry:8002")
MCP_REGISTRY_URL = os.environ.get("MCP_REGISTRY_URL", "http://mcp-registry:8001")

Every MCP server restart reconnects to NATS on 4222, re-registers with the MCP Registry on 8001, and re-announces to the Agent Registry on 8002. The recovery is not just "process is alive again" -- it is "process is alive and fully wired into the fleet."

What this looks like in practice

Here is a real scenario. The MCP server hits an unhandled exception while processing a malformed JSON-RPC request from a tool call. The process exits with code 1.

  1. The wrapper detects the exit. Logs: [WARN] MCP server exited with code 1. Restart attempt 1. Backoff: 2s.
  2. Waits 2 seconds. Restarts the server.
  3. The server starts successfully. The agent's next tool call works. Logs: [INFO] MCP server recovered after 1 restart.
  4. Total downtime: approximately 3 seconds.
  5. The agent retried one tool call. It did not lose context. It did not lose its task. It kept working.

Without the restart loop, that same scenario ends with the agent losing all tool access for the remainder of its session. The session eventually times out or gets killed by SLA enforcement. The task gets reassigned. Hours of accumulated context are lost. And someone has to figure out why.

The difference between these two outcomes is about 50 lines of shell script.

Why Self-Healing Matters More Than Features

There is a pattern in how agent infrastructure matures. The first phase is capabilities: can the agent do useful work? The second phase is reliability: does the agent keep doing useful work when things go wrong? The third phase is autonomy: does the system handle its own maintenance without human intervention?

Self-healing infrastructure is the bridge between reliability and autonomy. A system that restarts crashed processes is reliable. A system that submits sitemaps on deploy and catches missed submissions with a daily cron — without anyone configuring, triggering, or monitoring either path — is autonomous.

Both of the systems we shipped this cycle share three properties:

Invisible when working. No agent knows its MCP server crashed and recovered in 3 seconds. No human checks whether the sitemap was submitted after deploy. The systems produce no output, no notifications, no dashboards when they are functioning correctly. This is by design. Infrastructure that demands attention when it is working is infrastructure that taxes the humans and agents it is supposed to serve.

Loud when broken. Structured logs, failure counters, and extended backoff thresholds make it obvious when self-healing is not enough. Ten consecutive MCP crashes in the log file is a clear signal that something structural needs human investigation. A Search Console MCP query showing zero indexed pages after a week of automated submissions means the submission pipeline has a bug. The systems are silent in success and explicit in failure.

Layered redundancy. The sitemap system has two independent submission paths. The MCP restart system has escalating backoff tiers. Neither relies on a single mechanism. This is not over-engineering — it is acknowledging that any single mechanism will eventually fail, and the cost of that failure (invisible SEO degradation, hours of lost agent work) is high enough to justify a second path.

The Broader Pattern

Every production system we have built at agent.ceo follows this trajectory. We build the feature. We run it. We observe how it fails. Then we build the self-healing layer.

The memory governor self-heals memory pressure before the OOM-killer fires. Pull-based task discovery self-heals missed NATS messages by reconstructing workloads from durable state. SLA enforcement self-heals dropped tasks by detecting silence and reassigning work. SEO automation and MCP crash recovery are the latest additions to this pattern.

The goal is an AI organization that runs without ongoing human maintenance. Not without human oversight — that is a different question with different trade-offs. But without the daily restarts, the manual submissions, the "can someone check why the marketing agent stopped responding?" messages. Those should be automated first, because they prevent everything else from running.

Try It

Self-healing infrastructure is what separates an AI experiment from a cyborgenic organization that runs without ongoing human maintenance. My fleet of 11 agents has been running for months. The memory watchdog has caught dozens of OOM-bound processes before they crashed. The crash-recovery loop has restarted MCP servers hundreds of times without losing a single agent session. And I sleep through the night.

That is what 9,799 commits and 646 commits in May alone buys you -- not features, but reliability.


I'm Moshe Beeri. I build agent.ceo -- a cyborgenic organization where AI agents and humans ship software together. 9,799 commits and counting.

Related articles