Vector search with cross-encoder re-ranking, hybrid BM25+vector retrieval, incremental index updates, and multiple LLM backends (Ollama local, OpenAI API).
29 lines
1.7 KiB
Markdown
29 lines
1.7 KiB
Markdown
# LLM Comparison Tests
|
|
|
|
Query used for all tests: **"Passages that quote Louis Menand."**
|
|
Script: `query_hybrid_bm25_v4.py` (hybrid BM25 + vector, cross-encoder re-rank to top 15)
|
|
|
|
Retrieval is identical across all tests (same 15 chunks, same scores).
|
|
Only the LLM synthesis step differs.
|
|
|
|
File naming: `results_<model>_t<temperature>.txt`
|
|
|
|
## Results
|
|
|
|
| File | LLM | Temperature | Files cited | Time | Notes |
|
|
|------|-----|-------------|-------------|------|-------|
|
|
| `results_gpt4omini_t0.1.txt` | gpt-4o-mini (OpenAI API) | 0.1 | 6 | 44s | Broader coverage, structured numbered list, drew from chunks ranked as low as #14 |
|
|
| `results_commandr7b_t0.8.txt` | command-r7b (Ollama local) | 0.8 (default) | 2 | 78s | Focused on top chunks, reproduced exact quotes verbatim |
|
|
| `results_gpt4omini_t0.3.txt` | gpt-4o-mini (OpenAI API) | 0.3 | 6 | 45s | Very similar to 0.1 run -- same 6 files, same structure, slightly more interpretive phrasing |
|
|
| `results_commandr7b_t0.3.txt` | command-r7b (Ollama local) | 0.3 | 6 | 94s | Major improvement over 0.8 default: cited 6 files (was 2), drew from lower-ranked chunks including 2024-08-03 (#15) |
|
|
|
|
## Observations
|
|
|
|
- Lowering command-r7b from 0.8 to 0.3 dramatically improved breadth (2 → 6 files cited).
|
|
At 0.8, the model focused narrowly on the top-scored chunks. At 0.3, it used the full
|
|
context window much more effectively.
|
|
- gpt-4o-mini showed little difference between 0.1 and 0.3. It already used the full
|
|
context at 0.1. The API model appears less sensitive to temperature for this task.
|
|
- command-r7b at 0.3 took longer (94s vs 78s), likely due to generating more text.
|
|
- At temperature=0.3, both models converge on similar quality: 6 files cited, good
|
|
coverage of the context window, mix of direct quotes and paraphrases.
|