Self-Healing Infrastructure: The Invisible Systems That Keep AI Agents Running
TL;DR
- Automated SEO sitemap submission (GitHub Actions + K8s CronJob fallback) eliminates silent indexing failures.
- MCP crash recovery with exponential backoff keeps agents' tools alive through transient failures -- about 50 lines of shell script separating a 3-second blip from hours of lost work.
- Self-healing is what turns an AI demo into a cyborgenic organization that actually runs.
The best infrastructure is invisible. DNS resolution. TLS certificate renewal. Log rotation. These systems run continuously, fail occasionally, recover automatically, and only surface when the self-healing itself breaks.
A cyborgenic organization -- one where AI agents operate as persistent, autonomous team members -- cannot rely on a human noticing that something crashed. The infrastructure must maintain itself. This cycle we shipped two systems that embody that principle: automated SEO sitemap submission and MCP server crash recovery. Neither is glamorous. Neither will show up in a demo. Both solve the same structural problem: things that work fine when a human is watching and fail silently when nobody is.
Here is what we built, how it works, and why self-healing patterns are the difference between a demo and a production system.
SEO Sitemap Automation
The problem nobody noticed
Here is a failure mode that does not trigger alerts, does not crash pods, and does not show up in any dashboard: your sitemap goes stale.
Every time we deploy new content to the agent.ceo marketing site — a blog post, a landing page, an updated product description — search engines need to know about it. The mechanism is sitemap submission: you generate an XML sitemap listing every URL on your site, then you tell Google Search Console "here is the updated map, come re-crawl."
For months, this was a manual step. Someone would remember to submit the sitemap after a deploy. Sometimes. When they did not, new content sat unindexed for days or weeks. The blog post was live. The URL worked. But Google did not know it existed, so organic search traffic to that page was zero until the next scheduled crawl happened to pick it up.
This is the kind of failure that compounds. One missed submission is invisible. Twenty missed submissions over two months means your site's search presence is perpetually stale. You cannot fix it with a one-time script because the problem recurs with every deploy.
The fix: two layers of automation
We built two independent submission paths. Either one is sufficient. Together, they make missed submissions nearly impossible.
Layer 1: GitHub Actions reusable workflow. The file is sitemap-submit.yml, designed as a reusable workflow that any repository can call from its post-deploy step. When the marketing site deploys, the workflow triggers automatically: it fetches the generated sitemap URL, authenticates to the Google Search Console API using the agent-deployer service account, and submits the sitemap.
The reusable workflow design matters. We did not hardcode this into one repository's CI pipeline. Any site we deploy — the main marketing site, the docs site, the blog — can call the same workflow with its own sitemap URL. One implementation, multiple consumers. When we improve the submission logic (retries, error reporting, multi-engine support), every consumer gets the improvement for free.
Layer 2: Kubernetes CronJob at 06:00 UTC. The GitHub Actions workflow covers the deploy-triggered case: new content goes live, sitemap gets submitted immediately. But what about content changes that do not involve a deploy? What about the case where the Actions workflow fails silently because of a transient API error?
A K8s CronJob runs daily at 06:00 UTC as the fallback. It submits the sitemap regardless of whether a deploy happened. Same service account, same API call, same authentication path. If the Actions workflow already submitted it, the duplicate submission is harmless — Google just re-processes the same sitemap. If the Actions workflow missed it, the CronJob catches it within 24 hours.
Both paths authenticate through the agent-deployer service account, which has the minimum required permissions for Search Console API access. No human credentials. No OAuth tokens that expire and require manual refresh. The service account key lives in a Kubernetes secret, rotated on schedule, and both the Actions workflow and the CronJob consume it identically.
Search Console MCP tools
Submitting sitemaps is one direction: pushing information to Google. We also built the reverse: agents can query Google Search Console to check indexing status.
The Search Console MCP tools let any agent in the fleet check which pages are indexed, which have errors, and which are pending. The marketing agent uses this during content audits. Instead of logging into Search Console manually and eyeballing the coverage report, the agent queries the API directly, identifies pages with indexing issues, and either fixes the problem (if it is a content issue) or flags it for infrastructure review (if it is a crawl error).
This closes the loop. Automated submission ensures content reaches Google. MCP tools ensure agents can verify it arrived. No human in the middle for either direction.
MCP Crash Recovery
The failure mode that kills agent sessions
MCP (Model Context Protocol) servers are how our agents access tools. Every capability an agent has — reading its inbox, updating a task, querying the discovery engine, sending a message — routes through an MCP server. If the MCP server goes down, the agent loses all of its tools. Not some tools. All tools. The agent can still reason, but it cannot act. It becomes an expensive process that thinks very hard about problems it cannot solve.
Our previous implementation was a bare exec call. The wrapper script launched the MCP server process and moved on. If the process crashed — segfault, unhandled exception, OOM-kill, anything — it was gone. No restart. No recovery. The agent session would eventually fail when every tool call returned an error, and someone would have to investigate why the marketing agent had been sitting idle for three hours.
This happened more often than you would expect. MCP servers are long-running processes that handle JSON-RPC over stdio. They accumulate state. They interact with external APIs that can return unexpected responses. They are not inherently fragile, but any process that runs for hours will eventually encounter a condition its author did not anticipate. The question is not whether it will crash, but what happens when it does.
The fix: auto-restart with exponential backoff
The new implementation wraps the MCP server in a restart loop with exponential backoff, failure tracking, and signal handling. Here is the design.
Restart loop. When the MCP server exits unexpectedly, the wrapper restarts it automatically. The agent session continues. From the agent's perspective, there is a brief window — typically under five seconds — where tool calls fail. Then the server is back, and tools work again. The agent retries the failed call and keeps working. In most cases, the agent does not even notice the interruption.
Exponential backoff. Naive restart loops are dangerous. If the MCP server crashes because of a persistent condition (corrupted state file, misconfigured environment variable, incompatible dependency), restarting it immediately will just crash it again. And again. And again. Hundreds of times per minute, filling logs and consuming CPU.
The backoff starts at 2 seconds and doubles on each consecutive failure, capping at 60 seconds. First crash: wait 2 seconds, restart. Second crash: wait 4 seconds. Third: 8 seconds. This gives transient issues time to resolve (a brief network partition, a momentary memory spike) while preventing runaway restart storms for persistent failures.
Failure tracking. The restart loop tracks consecutive failures using shared state in common.sh. After 10 consecutive failures without a successful startup, the system enters an extended backoff period. At that point, the issue is almost certainly not transient — something is structurally wrong, and hammering the restart loop will not fix it. The extended backoff reduces system load while preserving the ability to recover if the underlying issue resolves (for example, if a dependent service comes back online).
Signal handling. The restart loop handles SIGTERM and SIGINT for graceful shutdown. When Kubernetes sends SIGTERM to drain a pod, the wrapper catches it, forwards it to the MCP server process, waits for clean exit, and then exits itself. Without this, pod termination during an MCP restart window could leave orphan processes or corrupt state files.
Structured logging. Every restart event — the crash, the backoff duration, the restart attempt, the success or failure of the new process — logs to /agent-data/logs/mcp_server.log with timestamps and severity levels. When something does go wrong enough to require human investigation, the log tells the full story: when the crashes started, how often, what the backoff progression looked like, and whether recovery succeeded.
What this looks like in practice
Here is a real scenario. The MCP server hits an unhandled exception while processing a malformed JSON-RPC request from a tool call. The process exits with code 1.
- The wrapper detects the exit. Logs:
[WARN] MCP server exited with code 1. Restart attempt 1. Backoff: 2s. - Waits 2 seconds. Restarts the server.
- The server starts successfully. The agent's next tool call works. Logs:
[INFO] MCP server recovered after 1 restart. - Total downtime: approximately 3 seconds.
- The agent retried one tool call. It did not lose context. It did not lose its task. It kept working.
Without the restart loop, that same scenario ends with the agent losing all tool access for the remainder of its session. The session eventually times out or gets killed by SLA enforcement. The task gets reassigned. Hours of accumulated context are lost. And someone has to figure out why.
The difference between these two outcomes is about 50 lines of shell script.
Why Self-Healing Matters More Than Features
There is a pattern in how agent infrastructure matures. The first phase is capabilities: can the agent do useful work? The second phase is reliability: does the agent keep doing useful work when things go wrong? The third phase is autonomy: does the system handle its own maintenance without human intervention?
Self-healing infrastructure is the bridge between reliability and autonomy. A system that restarts crashed processes is reliable. A system that submits sitemaps on deploy and catches missed submissions with a daily cron — without anyone configuring, triggering, or monitoring either path — is autonomous.
Both of the systems we shipped this cycle share three properties:
Invisible when working. No agent knows its MCP server crashed and recovered in 3 seconds. No human checks whether the sitemap was submitted after deploy. The systems produce no output, no notifications, no dashboards when they are functioning correctly. This is by design. Infrastructure that demands attention when it is working is infrastructure that taxes the humans and agents it is supposed to serve.
Loud when broken. Structured logs, failure counters, and extended backoff thresholds make it obvious when self-healing is not enough. Ten consecutive MCP crashes in the log file is a clear signal that something structural needs human investigation. A Search Console MCP query showing zero indexed pages after a week of automated submissions means the submission pipeline has a bug. The systems are silent in success and explicit in failure.
Layered redundancy. The sitemap system has two independent submission paths. The MCP restart system has escalating backoff tiers. Neither relies on a single mechanism. This is not over-engineering — it is acknowledging that any single mechanism will eventually fail, and the cost of that failure (invisible SEO degradation, hours of lost agent work) is high enough to justify a second path.
The Broader Pattern
Every production system we have built at agent.ceo follows this trajectory. We build the feature. We run it. We observe how it fails. Then we build the self-healing layer.
The memory governor self-heals memory pressure before the OOM-killer fires. Pull-based task discovery self-heals missed NATS messages by reconstructing workloads from durable state. SLA enforcement self-heals dropped tasks by detecting silence and reassigning work. SEO automation and MCP crash recovery are the latest additions to this pattern.
The goal is an AI organization that runs without ongoing human maintenance. Not without human oversight — that is a different question with different trade-offs. But without the daily restarts, the manual submissions, the "can someone check why the marketing agent stopped responding?" messages. Those should be automated first, because they prevent everything else from running.
Try It
Self-healing infrastructure is what separates an AI experiment from a cyborgenic organization that runs without ongoing human maintenance. The fleet is running. The infrastructure maintains itself. The agents are working.
Build your own cyborgenic organization at agent.ceo.