Document Ingestion Explained

How Documents Become Searchable

When documents are added to the knowledge base, they go through a multi-stage ingestion pipeline that extracts content, classifies the document type, and splits it into searchable chunks. This process ensures that information can be found and retrieved accurately when users ask questions.

The Ingestion Pipeline

1. Load2. Extract3. Classify4. Chunk5. Embed6. Index
  1. Load: Document file is read from storage (GCS bucket)
  2. Extract: Text, images, and tables are extracted from the document
  3. Classify: AI determines the document type (memo, research, FAQ, etc.)
  4. Chunk: Content is split using the optimal strategy for that document type
  5. Embed: Each chunk is converted to a 2048-dimensional vector
  6. Index: Vectors are added to Vertex AI Vector Search for retrieval

Supported File Formats

PDF Documents

Primary format for research reports, prospectuses, and formal documents. Supports text extraction, embedded images, and table detection.

Extensions: .pdf

Word Documents

Microsoft Word documents with full formatting support. Preserves headings, lists, and embedded content.

Extensions: .docx

Spreadsheets

Excel spreadsheets with tabular data. Each sheet is processed separately and tables are converted to searchable format.

Extensions: .xlsx

Markdown

Plain text with Markdown formatting. Ideal for documentation, notes, and structured text content.

Extensions: .md, .markdown

Document Classification

The system uses Gemini 3 Pro to intelligently classify documents into categories. This classification determines which chunking strategy is used, optimizing how the content is split for retrieval.

Classification Categories

Research Report: Long-form analysis with sections, charts, data
Memo: Internal communications, updates, commentary
FAQ: Question-answer pairs, knowledge base articles
Product Document: Prospectuses, fact sheets, fund documentation
Presentation: Slide decks with bullet points and visuals
Data/Table: Spreadsheets, structured data

Chunking Strategies

Different document types require different chunking approaches. The system automatically selects the best strategy based on classification.

Semantic Chunking

Splits content at natural semantic boundaries (paragraphs, sections). Each chunk contains a coherent unit of meaning.

Used for: Research reports, memos, long-form content

Atomic Chunking

Preserves question-answer pairs as single units. Ensures the answer stays with its question for precise retrieval.

Used for: FAQs, Q&A documents, knowledge base articles

Table Summarization

Tables are converted to Markdown format with an AI-generated summary describing what the table contains.

Used for: Spreadsheets, embedded tables, data documents

Structure-Preserving

Maintains document hierarchy (headings, subheadings) within chunks. Adds context about where in the document each chunk came from.

Used for: Presentations, structured documents

Multimodal Processing

The system processes not just text, but also visual content. This enables the Knowledge Assistant to answer questions about charts, graphs, and images.

Image Extraction

Images are extracted from PDFs and documents, stored in GCS, and made available for the LLM to “see” during response generation via multimodal image passing.

Table Extraction

Tables are detected, extracted, and converted to Markdown format for accurate text-based retrieval. Numerical data is preserved with proper formatting.

Smart Caching & Incremental Processing

The system uses MD5 hashing to track which documents have changed. On re-ingestion:

  • Unchanged files: Skipped entirely (15x speedup)
  • Modified files: Re-processed with new content
  • New files: Fully processed and indexed
  • Deleted files: Removed from the index

Related Articles