# ssearch Semantic search over a personal journal archive and a collection of clippings. Uses vector embeddings and a local LLM to find and synthesize information across 1800+ dated text entries spanning 2000-2025, plus a library of PDFs, articles, and web saves. ## How it works ``` Query → Embed (BAAI/bge-large-en-v1.5) → Vector similarity (top-30) → Cross-encoder re-rank (top-15) → LLM synthesis (command-r7b via Ollama, or OpenAI API) → Response + sources ``` 1. **Build**: Source files are chunked (256 tokens, 25-token overlap) and embedded into a vector store using LlamaIndex. The journal index uses LlamaIndex's JSON store; the clippings index uses ChromaDB. Both support incremental updates. 2. **Retrieve**: A user query is embedded with the same model and matched against stored vectors by cosine similarity, returning the top 30 candidate chunks. 3. **Re-rank**: A cross-encoder (`cross-encoder/ms-marco-MiniLM-L-12-v2`) scores each (query, chunk) pair jointly and keeps the top 15. 4. **Synthesize**: The re-ranked chunks are passed to a local LLM with a custom prompt that encourages multi-source synthesis, producing a grounded answer with file citations. ## Project structure ``` ssearch/ ├── build_store.py # Build/update journal vector store (incremental) ├── query_hybrid.py # Hybrid BM25+vector query with LLM synthesis ├── retrieve.py # Verbatim hybrid retrieval (no LLM) ├── search_keywords.py # Keyword search via POS-based term extraction ├── run_query.sh # Interactive shell wrapper with timing and logging ├── clippings_search/ │ ├── build_clippings.py # Build/update clippings vector store (ChromaDB) │ ├── retrieve_clippings.py # Verbatim clippings chunk retrieval │ └── store_clippings/ # Persisted clippings vector store (ChromaDB) ├── data/ # Symlink to journal .txt files ├── clippings/ # Symlink to clippings (PDFs, TXT, webarchive, RTF) ├── store/ # Persisted journal vector store ├── models/ # Cached HuggingFace models (offline) ├── archived/ # Superseded script versions ├── saved_output/ # Saved query results and model comparisons ├── requirements.txt # Python dependencies ├── devlog.md # Development log and experimental findings └── *.ipynb # Jupyter notebooks (HyDE, metrics, sandbox) ``` ## Setup **Prerequisites**: Python 3.12, [Ollama](https://ollama.com) with `command-r7b` pulled. ```bash cd ssearch python3 -m venv .venv source .venv/bin/activate pip install -r requirements.txt ``` The `data/` symlink should point to the journal archive (plain `.txt` files). The `clippings/` symlink should point to the clippings folder. The embedding model (`BAAI/bge-large-en-v1.5`) and cross-encoder (`cross-encoder/ms-marco-MiniLM-L-12-v2`) are cached in `./models/` for offline use. ### Offline model loading All query scripts set three environment variables to prevent HuggingFace from making network requests: ```python os.environ["TOKENIZERS_PARALLELISM"] = "false" os.environ["SENTENCE_TRANSFORMERS_HOME"] = "./models" os.environ["HF_HUB_OFFLINE"] = "1" ``` **These must appear before any imports that touch HuggingFace libraries.** The `huggingface_hub` library evaluates `HF_HUB_OFFLINE` once at import time (in `huggingface_hub/constants.py`). If the env var is set after imports, the library will still attempt network access and fail offline. Alternatively, set the variable in your shell before running Python: ```bash export HF_HUB_OFFLINE=1 python query_hybrid.py "your query" ``` ## Usage ### Build the vector stores ```bash # Journal index -- incremental update (default) python build_store.py # Journal index -- full rebuild python build_store.py --rebuild # Clippings index -- incremental update (default) python clippings_search/build_clippings.py # Clippings index -- full rebuild python clippings_search/build_clippings.py --rebuild ``` The default incremental mode loads the existing index, compares file sizes and modification dates, and only re-indexes what changed. A full rebuild is only needed when chunk parameters or the embedding model change. `build_clippings.py` handles PDFs, TXT, webarchive, and RTF files. PDFs are validated before indexing — those without extractable text are skipped and written to `ocr_needed.txt` for later OCR. ### Search journals #### Semantic search with LLM synthesis **Requires Ollama running with `command-r7b`.** **Hybrid BM25 + vector** (`query_hybrid.py`): Retrieves top 20 by vector similarity and top 20 by BM25 term frequency, merges and deduplicates, re-ranks the union to top 15, synthesizes. Catches exact name/term matches that vector-only retrieval misses. ```bash python query_hybrid.py "What does the author say about creativity?" ``` **Interactive wrapper** (`run_query.sh`): Loops for queries, displays timing, and appends queries to `query.log`. ```bash ./run_query.sh ``` #### Verbatim chunk retrieval (no LLM) Same hybrid retrieval and re-ranking pipeline but outputs raw chunk text. Each chunk is annotated with its source: `[vector-only]`, `[bm25-only]`, or `[vector+bm25]`. **No Ollama needed.** ```bash python retrieve.py "Kondiaronk and the Wendats" ``` #### Keyword search (no vector store, no LLM) Extracts nouns and adjectives from the query using NLTK POS tagging, then greps journal files for matches with surrounding context. ```bash python search_keywords.py "Discussions of Kondiaronk and the Wendats" ``` ### Search clippings Verbatim chunk retrieval from the clippings index. Same embedding model and cross-encoder re-ranking. Outputs a summary of source files and rankings, then full chunk text. Includes page numbers for PDF sources. **No Ollama needed.** ```bash python clippings_search/retrieve_clippings.py "creativity and innovation" ``` ### Output format ``` Response: Source documents: 2024-03-15.txt ./data/2024-03-15.txt 0.683 2023-11-02.txt ./data/2023-11-02.txt 0.651 ... ``` ## Configuration Key parameters (set in source files): | Parameter | Value | Location | |-----------|-------|----------| | Embedding model | `BAAI/bge-large-en-v1.5` | all build and query scripts | | Chunk size | 256 tokens | `build_store.py`, `clippings_search/build_clippings.py` | | Chunk overlap | 25 tokens | `build_store.py`, `clippings_search/build_clippings.py` | | Paragraph separator | `\n\n` | `build_store.py` | | Initial retrieval | 30 chunks | query and retrieve scripts | | Re-rank model | `cross-encoder/ms-marco-MiniLM-L-12-v2` | query and retrieve scripts | | Re-rank top-n | 15 | query and retrieve scripts | | LLM | `command-r7b` (Ollama) or `gpt-4o-mini` (OpenAI API) | `query_hybrid.py` | | Temperature | 0.3 | `query_hybrid.py` | | Context window | 8000 tokens | `query_hybrid.py` | | Request timeout | 360 seconds | `query_hybrid.py` | ## Key dependencies - **llama-index-core** (0.14.14) -- RAG framework - **llama-index-embeddings-huggingface** -- embedding integration - **llama-index-vector-stores-chroma** -- ChromaDB vector store for clippings - **llama-index-llms-ollama** -- local LLM via Ollama - **llama-index-llms-openai** -- OpenAI API LLM (optional) - **llama-index-retrievers-bm25** -- BM25 sparse retrieval for hybrid search - **chromadb** -- persistent vector store for clippings index - **sentence-transformers** -- cross-encoder re-ranking - **torch** -- ML runtime ## Notebooks Three Jupyter notebooks document exploration and analysis: - **`hyde.ipynb`** -- Experiments with HyDE (Hypothetical Document Embeddings) query rewriting. Finding: did not improve retrieval quality over direct prompt engineering. - **`sandbox.ipynb`** -- Exploratory notebook for learning the LlamaIndex API. - **`vs_metrics.ipynb`** -- Quantitative analysis of the vector store (embedding distributions, pairwise similarity, clustering, PCA/t-SNE projections). ## Design decisions - **BAAI/bge-large-en-v1.5 over all-mpnet-base-v2**: Better semantic matching quality for journal text despite slower embedding. - **256-token chunks**: Tested 512 and 384; 256 with 25-token overlap produced the highest quality matches. - **command-r7b over llama3.1:8B**: Sticks closer to provided context with less hallucination at comparable speed. - **Cross-encoder re-ranking**: Retrieve top-30 via bi-encoder, re-rank to top-15 with a cross-encoder that scores each (query, chunk) pair jointly. Tested three models; `ms-marco-MiniLM-L-12-v2` selected over `stsb-roberta-base` (wrong task) and `BAAI/bge-reranker-v2-m3` (slower, weak score tail). - **HyDE query rewriting tested and dropped**: Did not improve results over direct prompt engineering. - **Hybrid BM25 + vector retrieval**: BM25 nominates candidates with exact term matches that embeddings miss; the cross-encoder decides final relevance. - **ChromaDB for clippings**: Persistent SQLite-backed store. Chosen over the JSON store for its metadata filtering and direct chunk-level operations for incremental updates. - **PDF validation before indexing**: Pre-check each PDF with pypdf — skip if text extraction yields <100 chars or low printable ratio. Skipped files written to `ocr_needed.txt`. ## Development history - **Aug 2025**: Initial implementation -- build pipeline, embedding model comparison, chunk size experiments, HyDE testing. - **Jan 2026**: Command-line interface, prompt improvements, model comparison (command-r7b selected). - **Feb 2026**: Cross-encoder re-ranking, hybrid BM25+vector retrieval, LlamaIndex upgrade to 0.14.14, OpenAI API backend, incremental updates, clippings search (ChromaDB), project reorganization. See `devlog.md` for detailed development notes and experimental findings.