When you ask a question, it passes through several processing stages before a response is generated. Understanding these stages helps explain why some queries take longer than others and how to optimize for speed.
The system analyzes your query to determine its complexity (simple, medium, complex) and adjusts retrieval parameters accordingly. Simple queries fetch fewer candidates for faster responses.
The query is analyzed to detect intent (factual, comparison, temporal, exploratory) and extract entities (products, tickers, dates). This determines relevance thresholds and filter settings.
The query is converted to an embedding vector and searched against the knowledge base. Results are filtered by relevance, length, and entity overlap, then optionally reranked.
Retrieved chunks are passed to the LLM along with your query and conversation history. The model generates a response that streams back to you in real-time.
| Setting | Speed Impact | Quality Impact |
|---|---|---|
| Turbo variant | Much faster | Lower quality |
| Disable reranking | 200-500ms faster | Slightly lower |
| Disable HyDE | ~200ms faster | Worse for vague queries |
| Flash/Haiku models | ~1s faster TTFT | Good but not best |
| Lower top_k | Faster search | May miss relevant content |
The Analytics Dashboard provides detailed stage timing breakdowns including: