Skip to main content
Back to blog
Technical5 min read

Enterprise Knowledge Ingestion: 5,000 ERP Pages Into a Knowledge Graph in One Command

M
Moshe Beeri, Founder
/
knowledge-baseenterpriseerpingestionneo4jmcpcyborgenic-organization

Enterprise Knowledge Ingestion: 5,000 ERP Pages Into a Knowledge Graph in One Command

Enterprise software documentation is a graveyard of PDFs, wiki pages, and HTML exports. Thousands of pages describing modules, configurations, workflows, and edge cases — authored over years by dozens of people with inconsistent formatting and overlapping content.

AI agents need to work with this documentation. Not skim it. Not keyword-search it. Navigate it the way an experienced consultant would: understanding which modules depend on which, what configuration changes cascade where, and which workflow steps connect across subsystems.

We built the ingestion pipeline to make this possible.

The Problem With Document Dumps

The standard approach to "AI + enterprise docs" is: chunk the documents, embed the chunks, dump them in a vector store, and wire up RAG.

This works for question-answering. "What is the default timeout for batch processing?" — vector search finds the right chunk, the LLM reads it, done.

It fails for the questions that actually matter:

  • "What breaks if we change the accounting module's reconciliation schedule?"
  • "Which workflows depend on the inventory valuation method?"
  • "What's the full data flow from purchase order to general ledger?"

These questions require understanding relationships between concepts across hundreds of pages. Cosine similarity doesn't capture "module A's output feeds module B's input." Graph traversal does.

One Command, 5,000+ Pages

Our knowledge base ingestion pipeline handles enterprise documentation as a first-class use case. Point it at a documentation source — a git repository, a URL, a cloud storage bucket — and it:

  1. Crawls and parses every page, extracting content and metadata
  2. Generates vector embeddings for semantic search via the HNSW index
  3. Extracts entities — module names, configuration parameters, workflow steps, API endpoints
  4. Creates typed relationships between entities based on content analysis
  5. Stores everything in the Neo4j knowledge graph with org-scoped isolation

For a recent enterprise ERP documentation set, this processed over 5,000 pages into the graph. The result isn't a flat collection of searchable documents. It's a navigable knowledge graph where every module, workflow, and configuration is a node with typed edges connecting them.

What Agents Can Do With a Knowledge Graph

Once the documentation is in the graph, agents don't just search it — they traverse it.

Dependency mapping. An agent asked to assess the impact of changing a configuration parameter can traverse DEPENDS_ON and CONFIGURED_BY edges to find every module, workflow, and report affected. No keyword guessing required.

Cross-module analysis. Enterprise ERP systems have deep interdependencies between modules — procurement feeds inventory, inventory feeds manufacturing, manufacturing feeds cost accounting. The graph captures these chains explicitly, so agents can reason about cross-module impacts.

Onboarding acceleration. A new agent (or a new human consultant) pointed at the knowledge graph can explore the system's architecture by traversal. "Show me everything connected to the production planning module" returns a subgraph of related configurations, workflows, dependencies, and known issues — not a search results page.

Freshness and gap detection. The knowledge base tracks when pages were last updated and flags stale content. When an agent finds a configuration guide that hasn't been updated since the last major version, it flags the gap instead of confidently citing outdated information.

The Architecture

Rendering diagram…

The ingestion pipeline runs as part of the Agent.ceo knowledge base infrastructure. Multi-tenant isolation means each organization's ingested documentation stays in its own namespace — no cross-org data leakage, no shared embeddings, no co-mingled graphs.

Access control applies at every level: org scope, space permissions, and per-page restrictions. An agent with access to the "engineering" space sees technical documentation. An agent with access to the "finance" space sees accounting workflows. Neither sees the other's content unless explicitly shared.

Why This Matters for Enterprise AI

The gap between "AI chatbot that answers questions about docs" and "AI agent that understands a system" is the knowledge graph. Embeddings capture what a document says. The graph captures how concepts relate.

For enterprise software with thousands of interconnected modules, configurations, and workflows, that difference is the difference between a search engine and a consultant.

The Agent-Native Knowledge Base on Agent.ceo handles both: vector search for discovery, graph traversal for understanding. Ingest your documentation. Let your agents navigate it.

Interested in ingesting your enterprise documentation? Reach out at agent.ceo.

Related articles