Bitwise Knowledge Assistant

The Query Pipeline Stages

When you ask a question, it passes through several processing stages before a response is generated. Understanding these stages helps explain why some queries take longer than others and how to optimize for speed.

Assess→Analyze→Search→Generate

Assess Stage

Query Complexity Analysis

The system analyzes your query to determine its complexity (simple, medium, complex) and adjusts retrieval parameters accordingly. Simple queries fetch fewer candidates for faster responses.

Typical duration: 50-200ms

What happens: LLM classifies query complexity

Complexity Levels

Simple: Direct factual questions (“What is BITB's expense ratio?”)
Medium: Questions requiring synthesis (“How do Bitcoin ETFs work?”)
Complex: Multi-part or analytical questions (“Compare all crypto ETFs”)

Analyze Stage

Intent Detection

The query is analyzed to detect intent (factual, comparison, temporal, exploratory) and extract entities (products, tickers, dates). This determines relevance thresholds and filter settings.

Typical duration: 100-300ms

What happens: Entity extraction, intent classification, date parsing

Detected Intent Types

Factual: Seeking specific facts or numbers
Comparison: Comparing multiple items
Temporal: Time-based or historical questions
Exploratory: Broad learning or overview questions

Search Stage

Retrieval & Filtering

The query is converted to an embedding vector and searched against the knowledge base. Results are filtered by relevance, length, and entity overlap, then optionally reranked.

Typical duration: 200-800ms

What happens: Vector search, filtering, reranking (optional)

Search Sub-stages

Embed: Query converted to 2048-dimensional vector (~50ms)
Vector Search: Vertex AI finds similar chunks (~100-200ms)
BM25 Search: Keyword matching if hybrid enabled (~50ms)
Filter: Apply relevance and entity filters (~10ms)
Rerank: Vertex AI reordering if enabled (~200-500ms)
Cross-Encoder: Additional reranking if enabled (~200-400ms)

Generate Stage

Response Generation

Retrieved chunks are passed to the LLM along with your query and conversation history. The model generates a response that streams back to you in real-time.

Time to first token: 500ms-3s (varies by model)

Streaming rate: 50-150 chars/second

Why Some Queries Are Slow

Common Causes of Slow Responses

Complex queries: More candidates retrieved and processed
Reranking enabled: Adds 200-500ms per reranking stage
HyDE generation: Adds ~200ms for hypothetical document
Large context: More chunks means longer LLM processing
Pro models: Higher quality but slower than Flash/Haiku
Live data: MCP API calls add network latency

Optimizing for Speed

For Fastest Responses

Use Turbo experiment variant (disables HyDE and reranking)
Use Gemini Flash or Claude Haiku
Ask simple, focused questions
Disable hybrid search for pure semantic queries

When Speed vs. Quality Trade-off

Setting	Speed Impact	Quality Impact
Turbo variant	Much faster	Lower quality
Disable reranking	200-500ms faster	Slightly lower
Disable HyDE	~200ms faster	Worse for vague queries
Flash/Haiku models	~1s faster TTFT	Good but not best
Lower top_k	Faster search	May miss relevant content

Monitoring Performance

The Analytics Dashboard provides detailed stage timing breakdowns including:

Average, P50, and P95 timings for each stage
Comparisons across experiment variants
Trends over time
Breakdown by query complexity

Analytics Dashboard Guide - Monitor performance metrics
Understanding Experiment Variants - Configure retrieval for speed vs. quality