Semantic search over a directory of documents using local models

Python 94.7%
Shell 5.3%

Find a file

Eric Furst 1262129a4f RAG pipeline for semantic search over personal archives		2026-02-26 16:28:44 -05:00
clippings_search	RAG pipeline for semantic search over personal archives	2026-02-26 16:28:44 -05:00
.gitignore	RAG pipeline for semantic search over personal archives	2026-02-26 16:28:44 -05:00
build_store.py	RAG pipeline for semantic search over personal archives	2026-02-26 16:28:44 -05:00
LICENSE	RAG pipeline for semantic search over personal archives	2026-02-26 16:28:44 -05:00
query_hybrid.py	RAG pipeline for semantic search over personal archives	2026-02-26 16:28:44 -05:00
README.md	RAG pipeline for semantic search over personal archives	2026-02-26 16:28:44 -05:00
requirements.txt	RAG pipeline for semantic search over personal archives	2026-02-26 16:28:44 -05:00
retrieve.py	RAG pipeline for semantic search over personal archives	2026-02-26 16:28:44 -05:00
run_query.sh	RAG pipeline for semantic search over personal archives	2026-02-26 16:28:44 -05:00
search_keywords.py	RAG pipeline for semantic search over personal archives	2026-02-26 16:28:44 -05:00

README.md

ssearch

Semantic search over a journal archive and a collection of clippings (articles, PDFs, web saves). Uses vector embeddings and a local LLM to find and synthesize information across dated journal entries and a library of clippings files.

How it works

Query → Embed (BAAI/bge-large-en-v1.5) → Vector similarity (top-30) → Cross-encoder re-rank (top-15) → LLM synthesis (command-r7b via Ollama, or OpenAI API) → Response + sources

Build: Source files are chunked (256 tokens, 25-token overlap) and embedded into a vector store using LlamaIndex. The journal index uses LlamaIndex's JSON store; the clippings index uses ChromaDB. Both support incremental updates.
Retrieve: A user query is embedded with the same model and matched against stored vectors by cosine similarity, returning the top 30 candidate chunks.
Re-rank: A cross-encoder (cross-encoder/ms-marco-MiniLM-L-12-v2) scores each (query, chunk) pair jointly and keeps the top 15.
Synthesize: The re-ranked chunks are passed to a local LLM with a custom prompt that encourages multi-source synthesis, producing a grounded answer with file citations.

Project structure

ssearch/
├── build_store.py              # Build/update journal vector store (incremental)
├── query_hybrid.py             # Hybrid BM25+vector query with LLM synthesis
├── retrieve.py                 # Verbatim hybrid retrieval (no LLM)
├── search_keywords.py          # Keyword search via POS-based term extraction
├── run_query.sh                # Interactive shell wrapper with timing and logging
├── clippings_search/
│   ├── build_clippings.py      # Build/update clippings vector store (ChromaDB)
│   └── retrieve_clippings.py   # Verbatim clippings chunk retrieval
├── data/                       # Symlink to journal .txt files
├── clippings/                  # Symlink to clippings (PDFs, TXT, webarchive, RTF)
├── storage_exp/                # Persisted journal vector store (~242 MB)
├── storage_clippings/          # Persisted clippings vector store (ChromaDB)
├── models/                     # Cached HuggingFace models (offline)
└── requirements.txt            # Python dependencies

Setup

Prerequisites: Python 3.12, Ollama with command-r7b pulled.

cd ssearch
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

The data/ symlink should point to the journal archive (plain .txt files). The clippings/ symlink should point to the clippings folder (PDFs, TXT, webarchive, RTF). The embedding model (BAAI/bge-large-en-v1.5) and cross-encoder (cross-encoder/ms-marco-MiniLM-L-12-v2) are cached in ./models/ for offline use.

Offline model loading

All query scripts set three environment variables to prevent HuggingFace from making network requests:

os.environ["TOKENIZERS_PARALLELISM"] = "false"
os.environ["SENTENCE_TRANSFORMERS_HOME"] = "./models"
os.environ["HF_HUB_OFFLINE"] = "1"

These must appear before any imports that touch HuggingFace libraries. The huggingface_hub library evaluates HF_HUB_OFFLINE once at import time (in huggingface_hub/constants.py). If the env var is set after imports, the library will still attempt network access and fail offline.

Usage

Build the vector stores

# Journal index -- incremental update (default)
python build_store.py

# Journal index -- full rebuild
python build_store.py --rebuild

# Clippings index -- incremental update (default)
python clippings_search/build_clippings.py

# Clippings index -- full rebuild
python clippings_search/build_clippings.py --rebuild

The default incremental mode loads the existing index, compares file sizes and modification dates, and only re-indexes what changed. A full rebuild is only needed when chunk parameters or the embedding model change.

build_clippings.py handles PDFs, TXT, webarchive, and RTF files. PDFs are validated before indexing -- those without extractable text (scanned, encrypted) are skipped and written to ocr_needed.txt for later OCR processing.

Search journals

Semantic search with LLM synthesis

Requires Ollama running with command-r7b.

Hybrid BM25 + vector (query_hybrid.py): Retrieves top 20 by vector similarity and top 20 by BM25 term frequency, merges and deduplicates, re-ranks the union to top 15, synthesizes.

python query_hybrid.py "What does the author say about creativity?"

Interactive wrapper (run_query.sh): Loops for queries, displays timing, and appends queries to query.log.

./run_query.sh

Verbatim chunk retrieval (no LLM)

Same hybrid retrieval and re-ranking pipeline but outputs raw chunk text. Each chunk is annotated with its source: [vector-only], [bm25-only], or [vector+bm25]. No Ollama needed.

python retrieve.py "Kondiaronk and the Wendats"

Keyword search (no vector store, no LLM)

Extracts nouns and adjectives from the query using NLTK POS tagging, then greps the journal files for matches with surrounding context.

python search_keywords.py "Discussions of Kondiaronk and the Wendats"

Search clippings

Verbatim chunk retrieval from the clippings index. Same embedding model and cross-encoder re-ranking. Outputs a summary of source files and rankings, then the full chunk text. Includes page numbers for PDF sources. No Ollama needed.

python clippings_search/retrieve_clippings.py "creativity and innovation"

Configuration

Key parameters (set in source files):

Parameter	Value	Location
Embedding model	`BAAI/bge-large-en-v1.5`	all build and query scripts
Chunk size	256 tokens	`build_store.py`, `clippings_search/build_clippings.py`
Chunk overlap	25 tokens	`build_store.py`, `clippings_search/build_clippings.py`
Initial retrieval	30 chunks	query and retrieve scripts
Re-rank model	`cross-encoder/ms-marco-MiniLM-L-12-v2`	query and retrieve scripts
Re-rank top-n	15	query and retrieve scripts
LLM	`command-r7b` (Ollama) or `gpt-4o-mini` (OpenAI API)	`query_hybrid.py`
Temperature	0.3	`query_hybrid.py`
Context window	8000 tokens	`query_hybrid.py`

Key dependencies

llama-index-core (0.14.14) -- RAG framework
llama-index-embeddings-huggingface -- embedding integration
llama-index-vector-stores-chroma -- ChromaDB vector store for clippings
llama-index-llms-ollama -- local LLM via Ollama
llama-index-llms-openai -- OpenAI API LLM (optional)
llama-index-retrievers-bm25 -- BM25 sparse retrieval for hybrid search
chromadb -- persistent vector store for clippings index
sentence-transformers -- cross-encoder re-ranking
torch -- ML runtime

Design decisions

BAAI/bge-large-en-v1.5 over all-mpnet-base-v2: Better semantic matching quality for journal text despite slower embedding.
256-token chunks: Tested 512 and 384; 256 with 25-token overlap produced the highest quality matches.
command-r7b over llama3.1:8B: Sticks closer to provided context with less hallucination at comparable speed.
Cross-encoder re-ranking: Retrieve top-30 via bi-encoder, re-rank to top-15 with a cross-encoder that scores each (query, chunk) pair jointly. Tested three models; ms-marco-MiniLM-L-12-v2 selected over stsb-roberta-base (wrong task) and BAAI/bge-reranker-v2-m3 (slower, weak score tail).
Hybrid BM25 + vector retrieval: BM25 nominates candidates with exact term matches that embeddings miss; the cross-encoder decides final relevance.
ChromaDB for clippings: Persistent SQLite-backed store. Chosen over the JSON store used for journals because the clippings index handles more diverse file types and benefits from ChromaDB's metadata filtering and direct chunk-level operations for incremental updates.
PDF validation before indexing: Pre-check each PDF with pypdf -- skip if text extraction yields <100 chars or low printable ratio. Skipped files are written to ocr_needed.txt for later OCR processing.