Vector search with cross-encoder re-ranking, hybrid BM25+vector retrieval, incremental index updates, and multiple LLM backends (Ollama local, OpenAI API).
13 KiB
ssearch
Semantic search over a personal journal archive. Uses vector embeddings and a local LLM to find and synthesize information across 1800+ dated text entries spanning 2000-2025.
How it works
Query → Embed (BAAI/bge-large-en-v1.5) → Vector similarity (top-30) → Cross-encoder re-rank (top-15) → LLM synthesis (command-r7b via Ollama, or OpenAI API) → Response + sources
- Build: Journal entries in
./dataare chunked (256 tokens, 25-token overlap) and embedded into a vector store using LlamaIndex. Supports incremental updates (new/modified files only) or full rebuilds. - Retrieve: A user query is embedded with the same model and matched against stored vectors by cosine similarity, returning the top 30 candidate chunks.
- Re-rank: A cross-encoder (
cross-encoder/ms-marco-MiniLM-L-12-v2) scores each (query, chunk) pair jointly and keeps the top 15. - Synthesize: The re-ranked chunks are passed to a local LLM with a custom prompt that encourages multi-source synthesis, producing a grounded answer with file citations.
Project structure
ssearch/
├── build_exp_claude.py # Build/update vector store (incremental by default)
├── query_topk_prompt_engine_v3.py # Main query engine (cross-encoder re-ranking)
├── query_topk_prompt_engine_v2.py # Previous query engine (no re-ranking)
├── retrieve_raw.py # Verbatim chunk retrieval (no LLM)
├── query_hybrid_bm25_v4.py # Hybrid BM25 + vector query (v4)
├── retrieve_hybrid_raw.py # Hybrid verbatim retrieval (no LLM)
├── search_keywords.py # Keyword search via POS-based term extraction
├── run_query.sh # Shell wrapper with timing and logging
├── data/ # Symlink to ../text/ (journal .txt files)
├── storage_exp/ # Persisted vector store (~242 MB)
├── models/ # Cached HuggingFace models (embedding + cross-encoder, offline)
├── archived/ # Earlier iterations and prototypes
├── saved_output/ # Saved query results and model comparisons
├── requirements.txt # Python dependencies (pip freeze)
├── NOTES.md # Similarity metric reference
├── devlog.txt # Development log and experimental findings
└── *.ipynb # Jupyter notebooks (HyDE, metrics, sandbox)
Setup
Prerequisites: Python 3.12, Ollama with command-r7b pulled.
cd ssearch
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
The data/ symlink should point to ../text/ (the journal archive). The embedding model (BAAI/bge-large-en-v1.5) and cross-encoder (cross-encoder/ms-marco-MiniLM-L-12-v2) are cached in ./models/ for offline use.
Offline model loading
All query scripts set three environment variables to prevent HuggingFace from making network requests:
os.environ["TOKENIZERS_PARALLELISM"] = "false"
os.environ["SENTENCE_TRANSFORMERS_HOME"] = "./models"
os.environ["HF_HUB_OFFLINE"] = "1"
These must appear before any imports that touch HuggingFace libraries. The huggingface_hub library evaluates HF_HUB_OFFLINE once at import time (in huggingface_hub/constants.py). If the env var is set after imports, the library will still attempt network access and fail offline. This is a common pitfall -- llama_index.embeddings.huggingface transitively imports huggingface_hub, so even indirect imports trigger the evaluation.
Alternatively, set the variable in your shell before running Python:
export HF_HUB_OFFLINE=1
python query_hybrid_bm25_v4.py "your query"
Usage
Build the vector store
# Incremental update (default): only processes new, modified, or deleted files
python build_exp_claude.py
# Full rebuild from scratch
python build_exp_claude.py --rebuild
The default incremental mode loads the existing index, compares file sizes and modification dates against the docstore, and only re-indexes what changed. A full rebuild (--rebuild) is only needed when chunk parameters or the embedding model change.
Search
Three categories of search are available, from heaviest (semantic + LLM) to lightest (grep).
Semantic search with LLM synthesis
These scripts embed the query, retrieve candidate chunks from the vector store, re-rank with a cross-encoder, and pass the top results to a local LLM that synthesizes a grounded answer with file citations. Requires Ollama running with command-r7b.
Vector-only (query_topk_prompt_engine_v3.py): Retrieves the top 30 chunks by cosine similarity, re-ranks to top 15, synthesizes.
python query_topk_prompt_engine_v3.py "What does the author say about creativity?"
Hybrid BM25 + vector (query_hybrid_bm25_v4.py): Retrieves top 20 by vector similarity and top 20 by BM25 term frequency, merges and deduplicates, re-ranks the union to top 15, synthesizes. Catches exact name/term matches that vector-only retrieval misses.
python query_hybrid_bm25_v4.py "Louis Menand"
Interactive wrapper (run_query.sh): Loops for queries using the v3 engine, displays timing, and appends queries to query.log.
./run_query.sh
Verbatim chunk retrieval (no LLM)
These scripts run the same retrieval and re-ranking pipeline but output the raw chunk text instead of passing it to an LLM. Useful for inspecting what the retrieval pipeline finds, or when Ollama is not available. No Ollama needed.
Vector-only (retrieve_raw.py): Top-30 vector retrieval, cross-encoder re-rank to top 15, raw output.
python retrieve_raw.py "Kondiaronk and the Wendats"
Hybrid BM25 + vector (retrieve_hybrid_raw.py): Same hybrid retrieval as v4 but outputs raw chunks. Each chunk is annotated with its source: [vector-only], [bm25-only], or [vector+bm25].
python retrieve_hybrid_raw.py "Louis Menand"
Pipe either to less for browsing.
Keyword search (no vector store, no LLM)
search_keywords.py: Extracts nouns and adjectives from the query using NLTK POS tagging, then greps ./data/*.txt for matches with surrounding context. A lightweight fallback when you want exact string matching without the vector store. No vector store or Ollama needed.
python search_keywords.py "Discussions of Kondiaronk and the Wendats"
Output format
Response:
<LLM-synthesized answer citing specific files>
Source documents:
2024-03-15.txt ./data/2024-03-15.txt 0.683
2023-11-02.txt ./data/2023-11-02.txt 0.651
...
Configuration
Key parameters (set in source files):
| Parameter | Value | Location |
|---|---|---|
| Embedding model | BAAI/bge-large-en-v1.5 |
build_exp_claude.py, query_topk_prompt_engine_v3.py |
| Chunk size | 256 tokens | build_exp_claude.py |
| Chunk overlap | 25 tokens | build_exp_claude.py |
| Paragraph separator | \n\n |
build_exp_claude.py |
| Initial retrieval | 30 chunks | query_topk_prompt_engine_v3.py |
| Re-rank model | cross-encoder/ms-marco-MiniLM-L-12-v2 |
query_topk_prompt_engine_v3.py |
| Re-rank top-n | 15 | query_topk_prompt_engine_v3.py |
| LLM | command-r7b (Ollama) or gpt-4o-mini (OpenAI API) |
query_topk_prompt_engine_v3.py, query_hybrid_bm25_v4.py |
| Temperature | 0.3 (recommended for both local and API models) | query_topk_prompt_engine_v3.py, query_hybrid_bm25_v4.py |
| Context window | 8000 tokens | query_topk_prompt_engine_v3.py |
| Request timeout | 360 seconds | query_topk_prompt_engine_v3.py |
Key dependencies
- llama-index-core (0.14.14) -- RAG framework
- llama-index-embeddings-huggingface (0.6.1) -- embedding integration
- llama-index-llms-ollama (0.9.1) -- local LLM via Ollama
- llama-index-llms-openai (0.6.18) -- OpenAI API LLM (optional, for API-based synthesis)
- llama-index-readers-file (0.5.6) -- file readers
- llama-index-retrievers-bm25 (0.6.5) -- BM25 sparse retrieval for hybrid search
- sentence-transformers (5.1.0) -- embedding model support
- torch (2.8.0) -- ML runtime
Notebooks
Three Jupyter notebooks document exploration and analysis:
-
hyde.ipynb-- Experiments with HyDE (Hypothetical Document Embeddings) query rewriting. Tests whether generating a hypothetical answer to a query and embedding that instead improves retrieval. Uses LlamaIndex'sHyDEQueryTransformwithllama3.1:8B. Finding: the default HyDE prompt produced a rich hypothetical passage, but the technique did not improve retrieval quality over direct prompt engineering. This informed the decision to drop HyDE from the pipeline. -
sandbox.ipynb-- Exploratory notebook for learning the LlamaIndex API. Inspects thellama_index.coremodule (104 objects), lists available classes and methods, and reads the source ofVectorStoreIndex. Useful as a quick reference for what LlamaIndex exposes. -
vs_metrics.ipynb-- Quantitative analysis of the vector store. Loads the persisted index (4,692 vectors, 1024 dimensions each fromBAAI/bge-large-en-v1.5) and produces:- Distribution of embedding values (histogram)
- Heatmap of the full embedding matrix
- Embedding vector magnitude distribution
- Per-dimension variance (which dimensions carry more signal)
- Pairwise cosine similarity distribution and heatmap (subset)
- Hierarchical clustering dendrogram (Ward linkage)
- PCA and t-SNE 2D projections of the embedding space
Design decisions
- BAAI/bge-large-en-v1.5 over all-mpnet-base-v2: Better semantic matching quality for journal text despite slower embedding.
- 256-token chunks: Tested 512 and 384; 256 with 25-token overlap produced the highest quality matches.
- command-r7b over llama3.1:8B: Sticks closer to provided context with less hallucination at comparable speed.
- Top-k=15: Wide enough to capture diverse perspectives, narrow enough to fit the context window.
- Cross-encoder re-ranking (v3): Retrieve top-30 via bi-encoder, re-rank to top-15 with a cross-encoder that scores each (query, chunk) pair jointly. More accurate than bi-encoder similarity alone. Tested three models;
ms-marco-MiniLM-L-12-v2selected overstsb-roberta-base(wrong task -- semantic similarity, not passage ranking) andBAAI/bge-reranker-v2-m3(50% slower, weak score tail). - HyDE query rewriting tested and dropped: Did not improve results over direct prompt engineering.
- V3 prompt: Adapted for re-ranked context -- tells the LLM all excerpts have been curated, encourages examining every chunk and noting what each file contributes. Produces better multi-source synthesis than v2's prompt.
- V2 prompt: More flexible and query-adaptive than v1, which forced rigid structure (exactly 10 files, mandatory theme).
- Verbatim retrieval (
retrieve_raw.py): Uses LlamaIndex'sindex.as_retriever()instead ofindex.as_query_engine(). The retriever returns rawNodeWithScoreobjects (chunk text, metadata, scores) without invoking the LLM. The re-ranker is applied manually viareranker.postprocess_nodes(). This separation lets you inspect what the pipeline retrieves before synthesis. - Keyword search (
search_keywords.py): NLTK POS tagging extracts nouns and adjectives from the query -- a middle ground between naive stopword removal and LLM-based term extraction. Catches exact names, places, and dates that vector similarity misses. - Hybrid BM25 + vector retrieval (v4): Runs two retrievers in parallel -- BM25 (top-20 by term frequency) and vector similarity (top-20 by cosine) -- merges and deduplicates candidates, then lets the cross-encoder re-rank the union to top-15. BM25 nominates candidates with exact term matches that embeddings miss; the cross-encoder decides final relevance. Uses
BM25Retriever.from_defaults(index=index)fromllama-index-retrievers-bm25, which indexes the nodes already stored in the persisted vector store.
Development history
- Aug 2025: Initial implementation -- build pipeline, embedding model comparison, chunk size experiments, HyDE testing, prompt v1.
- Jan 2026: Command-line interface, v2 prompt, error handling improvements, model comparison (command-r7b selected).
- Feb 2026: Project tidy-up, cross-encoder re-ranking (v3), v3 prompt for multi-source synthesis, cross-encoder model comparison (L-12 selected), archived superseded scripts. Hybrid BM25 + vector retrieval (v4). Upgraded LlamaIndex from 0.13.1 to 0.14.14; added OpenAI API as optional LLM backend (
llama-index-llms-openai). Incremental vector store updates (default mode inbuild_exp_claude.py). Fixed offline HuggingFace model loading (env vars must precede imports).
See devlog.txt for detailed development notes and experimental findings.