Semantic search over a directory of documents using local models

Python 94.7%
Shell 5.3%

Find a file

Eric Furst 13785d667a Rename storage_exp/ to store/, remove unused storage/ Update all active scripts, .gitignore, CLAUDE.md, and README.md. Also fix stale filename references in script header comments.		2026-02-26 16:36:57 -05:00
archived	Reorganize project: rename scripts, archive superseded, add clippings_search/	2026-02-26 16:24:32 -05:00
clippings_search	Reorganize project: rename scripts, archive superseded, add clippings_search/	2026-02-26 16:24:32 -05:00
saved_output	Initial commit: RAG pipeline for semantic search over personal journal archive	2026-02-20 06:02:28 -05:00
tests	Initial commit: RAG pipeline for semantic search over personal journal archive	2026-02-20 06:02:28 -05:00
.gitignore	Rename storage_exp/ to store/, remove unused storage/	2026-02-26 16:36:57 -05:00
build_store.py	Rename storage_exp/ to store/, remove unused storage/	2026-02-26 16:36:57 -05:00
devlog.txt	Initial commit: RAG pipeline for semantic search over personal journal archive	2026-02-20 06:02:28 -05:00
hyde.ipynb	Initial commit: RAG pipeline for semantic search over personal journal archive	2026-02-20 06:02:28 -05:00
NOTES.md	Initial commit: RAG pipeline for semantic search over personal journal archive	2026-02-20 06:02:28 -05:00
query_hybrid.py	Rename storage_exp/ to store/, remove unused storage/	2026-02-26 16:36:57 -05:00
query_topk_prompt_engine_v2.py	Initial commit: RAG pipeline for semantic search over personal journal archive	2026-02-20 06:02:28 -05:00
README.md	Rename storage_exp/ to store/, remove unused storage/	2026-02-26 16:36:57 -05:00
requirements.txt	Built semantic search over clippings files.	2026-02-22 07:48:48 -05:00
retrieve.py	Rename storage_exp/ to store/, remove unused storage/	2026-02-26 16:36:57 -05:00
run_query.sh	Reorganize project: rename scripts, archive superseded, add clippings_search/	2026-02-26 16:24:32 -05:00
sandbox.ipynb	Initial commit: RAG pipeline for semantic search over personal journal archive	2026-02-20 06:02:28 -05:00
search_keywords.py	Initial commit: RAG pipeline for semantic search over personal journal archive	2026-02-20 06:02:28 -05:00
vs_metrics.ipynb	Initial commit: RAG pipeline for semantic search over personal journal archive	2026-02-20 06:02:28 -05:00

README.md

ssearch

Semantic search over a personal journal archive and a collection of clippings. Uses vector embeddings and a local LLM to find and synthesize information across 1800+ dated text entries spanning 2000-2025, plus a library of PDFs, articles, and web saves.

How it works

Query → Embed (BAAI/bge-large-en-v1.5) → Vector similarity (top-30) → Cross-encoder re-rank (top-15) → LLM synthesis (command-r7b via Ollama, or OpenAI API) → Response + sources

Build: Source files are chunked (256 tokens, 25-token overlap) and embedded into a vector store using LlamaIndex. The journal index uses LlamaIndex's JSON store; the clippings index uses ChromaDB. Both support incremental updates.
Retrieve: A user query is embedded with the same model and matched against stored vectors by cosine similarity, returning the top 30 candidate chunks.
Re-rank: A cross-encoder (cross-encoder/ms-marco-MiniLM-L-12-v2) scores each (query, chunk) pair jointly and keeps the top 15.
Synthesize: The re-ranked chunks are passed to a local LLM with a custom prompt that encourages multi-source synthesis, producing a grounded answer with file citations.

Project structure

ssearch/
├── build_store.py              # Build/update journal vector store (incremental)
├── query_hybrid.py             # Hybrid BM25+vector query with LLM synthesis
├── retrieve.py                 # Verbatim hybrid retrieval (no LLM)
├── search_keywords.py          # Keyword search via POS-based term extraction
├── run_query.sh                # Interactive shell wrapper with timing and logging
├── clippings_search/
│   ├── build_clippings.py      # Build/update clippings vector store (ChromaDB)
│   └── retrieve_clippings.py   # Verbatim clippings chunk retrieval
├── data/                       # Symlink to journal .txt files
├── clippings/                  # Symlink to clippings (PDFs, TXT, webarchive, RTF)
├── store/                # Persisted journal vector store (~242 MB)
├── storage_clippings/          # Persisted clippings vector store (ChromaDB)
├── models/                     # Cached HuggingFace models (offline)
├── archived/                   # Superseded script versions
├── saved_output/               # Saved query results and model comparisons
├── requirements.txt            # Python dependencies
├── devlog.txt                  # Development log and experimental findings
└── *.ipynb                     # Jupyter notebooks (HyDE, metrics, sandbox)

Setup

Prerequisites: Python 3.12, Ollama with command-r7b pulled.

cd ssearch
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

The data/ symlink should point to the journal archive (plain .txt files). The clippings/ symlink should point to the clippings folder. The embedding model (BAAI/bge-large-en-v1.5) and cross-encoder (cross-encoder/ms-marco-MiniLM-L-12-v2) are cached in ./models/ for offline use.

Offline model loading

All query scripts set three environment variables to prevent HuggingFace from making network requests:

os.environ["TOKENIZERS_PARALLELISM"] = "false"
os.environ["SENTENCE_TRANSFORMERS_HOME"] = "./models"
os.environ["HF_HUB_OFFLINE"] = "1"

These must appear before any imports that touch HuggingFace libraries. The huggingface_hub library evaluates HF_HUB_OFFLINE once at import time (in huggingface_hub/constants.py). If the env var is set after imports, the library will still attempt network access and fail offline.

Alternatively, set the variable in your shell before running Python:

export HF_HUB_OFFLINE=1
python query_hybrid.py "your query"

Usage

Build the vector stores

# Journal index -- incremental update (default)
python build_store.py

# Journal index -- full rebuild
python build_store.py --rebuild

# Clippings index -- incremental update (default)
python clippings_search/build_clippings.py

# Clippings index -- full rebuild
python clippings_search/build_clippings.py --rebuild

The default incremental mode loads the existing index, compares file sizes and modification dates, and only re-indexes what changed. A full rebuild is only needed when chunk parameters or the embedding model change.

build_clippings.py handles PDFs, TXT, webarchive, and RTF files. PDFs are validated before indexing — those without extractable text are skipped and written to ocr_needed.txt for later OCR.

Search journals

Semantic search with LLM synthesis

Requires Ollama running with command-r7b.

Hybrid BM25 + vector (query_hybrid.py): Retrieves top 20 by vector similarity and top 20 by BM25 term frequency, merges and deduplicates, re-ranks the union to top 15, synthesizes. Catches exact name/term matches that vector-only retrieval misses.

python query_hybrid.py "What does the author say about creativity?"

Interactive wrapper (run_query.sh): Loops for queries, displays timing, and appends queries to query.log.

./run_query.sh

Verbatim chunk retrieval (no LLM)

Same hybrid retrieval and re-ranking pipeline but outputs raw chunk text. Each chunk is annotated with its source: [vector-only], [bm25-only], or [vector+bm25]. No Ollama needed.

python retrieve.py "Kondiaronk and the Wendats"

Keyword search (no vector store, no LLM)

Extracts nouns and adjectives from the query using NLTK POS tagging, then greps journal files for matches with surrounding context.

python search_keywords.py "Discussions of Kondiaronk and the Wendats"

Search clippings

Verbatim chunk retrieval from the clippings index. Same embedding model and cross-encoder re-ranking. Outputs a summary of source files and rankings, then full chunk text. Includes page numbers for PDF sources. No Ollama needed.

python clippings_search/retrieve_clippings.py "creativity and innovation"

Output format

Response:
<LLM-synthesized answer citing specific files>

Source documents:
2024-03-15.txt  ./data/2024-03-15.txt  0.683
2023-11-02.txt  ./data/2023-11-02.txt  0.651
...

Configuration

Key parameters (set in source files):

Parameter	Value	Location
Embedding model	`BAAI/bge-large-en-v1.5`	all build and query scripts
Chunk size	256 tokens	`build_store.py`, `clippings_search/build_clippings.py`
Chunk overlap	25 tokens	`build_store.py`, `clippings_search/build_clippings.py`
Paragraph separator	`\n\n`	`build_store.py`
Initial retrieval	30 chunks	query and retrieve scripts
Re-rank model	`cross-encoder/ms-marco-MiniLM-L-12-v2`	query and retrieve scripts
Re-rank top-n	15	query and retrieve scripts
LLM	`command-r7b` (Ollama) or `gpt-4o-mini` (OpenAI API)	`query_hybrid.py`
Temperature	0.3	`query_hybrid.py`
Context window	8000 tokens	`query_hybrid.py`
Request timeout	360 seconds	`query_hybrid.py`

Key dependencies

llama-index-core (0.14.14) -- RAG framework
llama-index-embeddings-huggingface -- embedding integration
llama-index-vector-stores-chroma -- ChromaDB vector store for clippings
llama-index-llms-ollama -- local LLM via Ollama
llama-index-llms-openai -- OpenAI API LLM (optional)
llama-index-retrievers-bm25 -- BM25 sparse retrieval for hybrid search
chromadb -- persistent vector store for clippings index
sentence-transformers -- cross-encoder re-ranking
torch -- ML runtime

Notebooks

Three Jupyter notebooks document exploration and analysis:

hyde.ipynb -- Experiments with HyDE (Hypothetical Document Embeddings) query rewriting. Finding: did not improve retrieval quality over direct prompt engineering.
sandbox.ipynb -- Exploratory notebook for learning the LlamaIndex API.
vs_metrics.ipynb -- Quantitative analysis of the vector store (embedding distributions, pairwise similarity, clustering, PCA/t-SNE projections).

Design decisions

BAAI/bge-large-en-v1.5 over all-mpnet-base-v2: Better semantic matching quality for journal text despite slower embedding.
256-token chunks: Tested 512 and 384; 256 with 25-token overlap produced the highest quality matches.
command-r7b over llama3.1:8B: Sticks closer to provided context with less hallucination at comparable speed.
Cross-encoder re-ranking: Retrieve top-30 via bi-encoder, re-rank to top-15 with a cross-encoder that scores each (query, chunk) pair jointly. Tested three models; ms-marco-MiniLM-L-12-v2 selected over stsb-roberta-base (wrong task) and BAAI/bge-reranker-v2-m3 (slower, weak score tail).
HyDE query rewriting tested and dropped: Did not improve results over direct prompt engineering.
Hybrid BM25 + vector retrieval: BM25 nominates candidates with exact term matches that embeddings miss; the cross-encoder decides final relevance.
ChromaDB for clippings: Persistent SQLite-backed store. Chosen over the JSON store for its metadata filtering and direct chunk-level operations for incremental updates.
PDF validation before indexing: Pre-check each PDF with pypdf — skip if text extraction yields <100 chars or low printable ratio. Skipped files written to ocr_needed.txt.

Development history

Aug 2025: Initial implementation -- build pipeline, embedding model comparison, chunk size experiments, HyDE testing.
Jan 2026: Command-line interface, prompt improvements, model comparison (command-r7b selected).
Feb 2026: Cross-encoder re-ranking, hybrid BM25+vector retrieval, LlamaIndex upgrade to 0.14.14, OpenAI API backend, incremental updates, clippings search (ChromaDB), project reorganization.

See devlog.txt for detailed development notes and experimental findings.