History

Eric e9fc99ddc6 Initial commit: RAG pipeline for semantic search over personal journal archive Vector search with cross-encoder re-ranking, hybrid BM25+vector retrieval, incremental index updates, and multiple LLM backends (Ollama local, OpenAI API).		2026-02-20 06:02:28 -05:00
..
README.md	Initial commit: RAG pipeline for semantic search over personal journal archive	2026-02-20 06:02:28 -05:00
results_commandr7b_t0.3.txt	Initial commit: RAG pipeline for semantic search over personal journal archive	2026-02-20 06:02:28 -05:00
results_commandr7b_t0.8.txt	Initial commit: RAG pipeline for semantic search over personal journal archive	2026-02-20 06:02:28 -05:00
results_gpt4omini_t0.1.txt	Initial commit: RAG pipeline for semantic search over personal journal archive	2026-02-20 06:02:28 -05:00
results_gpt4omini_t0.3.txt	Initial commit: RAG pipeline for semantic search over personal journal archive	2026-02-20 06:02:28 -05:00

README.md

LLM Comparison Tests

Query used for all tests: "Passages that quote Louis Menand." Script: query_hybrid_bm25_v4.py (hybrid BM25 + vector, cross-encoder re-rank to top 15)

Retrieval is identical across all tests (same 15 chunks, same scores). Only the LLM synthesis step differs.

File naming: results_<model>_t<temperature>.txt

Results

File	LLM	Temperature	Files cited	Time	Notes
`results_gpt4omini_t0.1.txt`	gpt-4o-mini (OpenAI API)	0.1	6	44s	Broader coverage, structured numbered list, drew from chunks ranked as low as #14
`results_commandr7b_t0.8.txt`	command-r7b (Ollama local)	0.8 (default)	2	78s	Focused on top chunks, reproduced exact quotes verbatim
`results_gpt4omini_t0.3.txt`	gpt-4o-mini (OpenAI API)	0.3	6	45s	Very similar to 0.1 run -- same 6 files, same structure, slightly more interpretive phrasing
`results_commandr7b_t0.3.txt`	command-r7b (Ollama local)	0.3	6	94s	Major improvement over 0.8 default: cited 6 files (was 2), drew from lower-ranked chunks including 2024-08-03 (#15)

Observations

Lowering command-r7b from 0.8 to 0.3 dramatically improved breadth (2 → 6 files cited). At 0.8, the model focused narrowly on the top-scored chunks. At 0.3, it used the full context window much more effectively.
gpt-4o-mini showed little difference between 0.1 and 0.3. It already used the full context at 0.1. The API model appears less sensitive to temperature for this task.
command-r7b at 0.3 took longer (94s vs 78s), likely due to generating more text.
At temperature=0.3, both models converge on similar quality: 6 files cited, good coverage of the context window, mix of direct quotes and paraphrases.