When documents are added to the knowledge base, they go through a multi-stage ingestion pipeline that extracts content, classifies the document type, and splits it into searchable chunks. This process ensures that information can be found and retrieved accurately when users ask questions.
Primary format for research reports, prospectuses, and formal documents. Supports text extraction, embedded images, and table detection.
Microsoft Word documents with full formatting support. Preserves headings, lists, and embedded content.
Excel spreadsheets with tabular data. Each sheet is processed separately and tables are converted to searchable format.
Plain text with Markdown formatting. Ideal for documentation, notes, and structured text content.
The system uses Gemini 3 Pro to intelligently classify documents into categories. This classification determines which chunking strategy is used, optimizing how the content is split for retrieval.
Different document types require different chunking approaches. The system automatically selects the best strategy based on classification.
Splits content at natural semantic boundaries (paragraphs, sections). Each chunk contains a coherent unit of meaning.
Preserves question-answer pairs as single units. Ensures the answer stays with its question for precise retrieval.
Tables are converted to Markdown format with an AI-generated summary describing what the table contains.
Maintains document hierarchy (headings, subheadings) within chunks. Adds context about where in the document each chunk came from.
The system processes not just text, but also visual content. This enables the Knowledge Assistant to answer questions about charts, graphs, and images.
Images are extracted from PDFs and documents, stored in GCS, and made available for the LLM to “see” during response generation via multimodal image passing.
Tables are detected, extracted, and converted to Markdown format for accurate text-based retrieval. Numerical data is preserved with proper formatting.
The system uses MD5 hashing to track which documents have changed. On re-ingestion: