Stage Timing & Performance

The Query Pipeline Stages

When you ask a question, it passes through several processing stages before a response is generated. Understanding these stages helps explain why some queries take longer than others and how to optimize for speed.

AssessAnalyzeSearchGenerate

Assess Stage

Query Complexity Analysis

The system analyzes your query to determine its complexity (simple, medium, complex) and adjusts retrieval parameters accordingly. Simple queries fetch fewer candidates for faster responses.

Typical duration: 50-200ms
What happens: LLM classifies query complexity

Complexity Levels

  • Simple: Direct factual questions (“What is BITB's expense ratio?”)
  • Medium: Questions requiring synthesis (“How do Bitcoin ETFs work?”)
  • Complex: Multi-part or analytical questions (“Compare all crypto ETFs”)

Analyze Stage

Intent Detection

The query is analyzed to detect intent (factual, comparison, temporal, exploratory) and extract entities (products, tickers, dates). This determines relevance thresholds and filter settings.

Typical duration: 100-300ms
What happens: Entity extraction, intent classification, date parsing

Detected Intent Types

  • Factual: Seeking specific facts or numbers
  • Comparison: Comparing multiple items
  • Temporal: Time-based or historical questions
  • Exploratory: Broad learning or overview questions

Retrieval & Filtering

The query is converted to an embedding vector and searched against the knowledge base. Results are filtered by relevance, length, and entity overlap, then optionally reranked.

Typical duration: 200-800ms
What happens: Vector search, filtering, reranking (optional)

Search Sub-stages

  • Embed: Query converted to 2048-dimensional vector (~50ms)
  • Vector Search: Vertex AI finds similar chunks (~100-200ms)
  • BM25 Search: Keyword matching if hybrid enabled (~50ms)
  • Filter: Apply relevance and entity filters (~10ms)
  • Rerank: Vertex AI reordering if enabled (~200-500ms)
  • Cross-Encoder: Additional reranking if enabled (~200-400ms)

Generate Stage

Response Generation

Retrieved chunks are passed to the LLM along with your query and conversation history. The model generates a response that streams back to you in real-time.

Time to first token: 500ms-3s (varies by model)
Streaming rate: 50-150 chars/second

Why Some Queries Are Slow

Common Causes of Slow Responses

  • Complex queries: More candidates retrieved and processed
  • Reranking enabled: Adds 200-500ms per reranking stage
  • HyDE generation: Adds ~200ms for hypothetical document
  • Large context: More chunks means longer LLM processing
  • Pro models: Higher quality but slower than Flash/Haiku
  • Live data: MCP API calls add network latency

Optimizing for Speed

For Fastest Responses

  • Use Turbo experiment variant (disables HyDE and reranking)
  • Use Gemini Flash or Claude Haiku
  • Ask simple, focused questions
  • Disable hybrid search for pure semantic queries

When Speed vs. Quality Trade-off

SettingSpeed ImpactQuality Impact
Turbo variantMuch fasterLower quality
Disable reranking200-500ms fasterSlightly lower
Disable HyDE~200ms fasterWorse for vague queries
Flash/Haiku models~1s faster TTFTGood but not best
Lower top_kFaster searchMay miss relevant content

Monitoring Performance

The Analytics Dashboard provides detailed stage timing breakdowns including:

  • Average, P50, and P95 timings for each stage
  • Comparisons across experiment variants
  • Trends over time
  • Breakdown by query complexity

Related Articles