- Python 95.9%
- Shell 4.1%
Update all active scripts, .gitignore, CLAUDE.md, and README.md. Also fix stale filename references in script header comments. |
||
|---|---|---|
| archived | ||
| clippings_search | ||
| saved_output | ||
| tests | ||
| .gitignore | ||
| build_store.py | ||
| devlog.txt | ||
| hyde.ipynb | ||
| NOTES.md | ||
| query_hybrid.py | ||
| query_topk_prompt_engine_v2.py | ||
| README.md | ||
| requirements.txt | ||
| retrieve.py | ||
| run_query.sh | ||
| sandbox.ipynb | ||
| search_keywords.py | ||
| vs_metrics.ipynb | ||
ssearch
Semantic search over a personal journal archive and a collection of clippings. Uses vector embeddings and a local LLM to find and synthesize information across 1800+ dated text entries spanning 2000-2025, plus a library of PDFs, articles, and web saves.
How it works
Query → Embed (BAAI/bge-large-en-v1.5) → Vector similarity (top-30) → Cross-encoder re-rank (top-15) → LLM synthesis (command-r7b via Ollama, or OpenAI API) → Response + sources
- Build: Source files are chunked (256 tokens, 25-token overlap) and embedded into a vector store using LlamaIndex. The journal index uses LlamaIndex's JSON store; the clippings index uses ChromaDB. Both support incremental updates.
- Retrieve: A user query is embedded with the same model and matched against stored vectors by cosine similarity, returning the top 30 candidate chunks.
- Re-rank: A cross-encoder (
cross-encoder/ms-marco-MiniLM-L-12-v2) scores each (query, chunk) pair jointly and keeps the top 15. - Synthesize: The re-ranked chunks are passed to a local LLM with a custom prompt that encourages multi-source synthesis, producing a grounded answer with file citations.
Project structure
ssearch/
├── build_store.py # Build/update journal vector store (incremental)
├── query_hybrid.py # Hybrid BM25+vector query with LLM synthesis
├── retrieve.py # Verbatim hybrid retrieval (no LLM)
├── search_keywords.py # Keyword search via POS-based term extraction
├── run_query.sh # Interactive shell wrapper with timing and logging
├── clippings_search/
│ ├── build_clippings.py # Build/update clippings vector store (ChromaDB)
│ └── retrieve_clippings.py # Verbatim clippings chunk retrieval
├── data/ # Symlink to journal .txt files
├── clippings/ # Symlink to clippings (PDFs, TXT, webarchive, RTF)
├── store/ # Persisted journal vector store (~242 MB)
├── storage_clippings/ # Persisted clippings vector store (ChromaDB)
├── models/ # Cached HuggingFace models (offline)
├── archived/ # Superseded script versions
├── saved_output/ # Saved query results and model comparisons
├── requirements.txt # Python dependencies
├── devlog.txt # Development log and experimental findings
└── *.ipynb # Jupyter notebooks (HyDE, metrics, sandbox)
Setup
Prerequisites: Python 3.12, Ollama with command-r7b pulled.
cd ssearch
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
The data/ symlink should point to the journal archive (plain .txt files). The clippings/ symlink should point to the clippings folder. The embedding model (BAAI/bge-large-en-v1.5) and cross-encoder (cross-encoder/ms-marco-MiniLM-L-12-v2) are cached in ./models/ for offline use.
Offline model loading
All query scripts set three environment variables to prevent HuggingFace from making network requests:
os.environ["TOKENIZERS_PARALLELISM"] = "false"
os.environ["SENTENCE_TRANSFORMERS_HOME"] = "./models"
os.environ["HF_HUB_OFFLINE"] = "1"
These must appear before any imports that touch HuggingFace libraries. The huggingface_hub library evaluates HF_HUB_OFFLINE once at import time (in huggingface_hub/constants.py). If the env var is set after imports, the library will still attempt network access and fail offline.
Alternatively, set the variable in your shell before running Python:
export HF_HUB_OFFLINE=1
python query_hybrid.py "your query"
Usage
Build the vector stores
# Journal index -- incremental update (default)
python build_store.py
# Journal index -- full rebuild
python build_store.py --rebuild
# Clippings index -- incremental update (default)
python clippings_search/build_clippings.py
# Clippings index -- full rebuild
python clippings_search/build_clippings.py --rebuild
The default incremental mode loads the existing index, compares file sizes and modification dates, and only re-indexes what changed. A full rebuild is only needed when chunk parameters or the embedding model change.
build_clippings.py handles PDFs, TXT, webarchive, and RTF files. PDFs are validated before indexing — those without extractable text are skipped and written to ocr_needed.txt for later OCR.
Search journals
Semantic search with LLM synthesis
Requires Ollama running with command-r7b.
Hybrid BM25 + vector (query_hybrid.py): Retrieves top 20 by vector similarity and top 20 by BM25 term frequency, merges and deduplicates, re-ranks the union to top 15, synthesizes. Catches exact name/term matches that vector-only retrieval misses.
python query_hybrid.py "What does the author say about creativity?"
Interactive wrapper (run_query.sh): Loops for queries, displays timing, and appends queries to query.log.
./run_query.sh
Verbatim chunk retrieval (no LLM)
Same hybrid retrieval and re-ranking pipeline but outputs raw chunk text. Each chunk is annotated with its source: [vector-only], [bm25-only], or [vector+bm25]. No Ollama needed.
python retrieve.py "Kondiaronk and the Wendats"
Keyword search (no vector store, no LLM)
Extracts nouns and adjectives from the query using NLTK POS tagging, then greps journal files for matches with surrounding context.
python search_keywords.py "Discussions of Kondiaronk and the Wendats"
Search clippings
Verbatim chunk retrieval from the clippings index. Same embedding model and cross-encoder re-ranking. Outputs a summary of source files and rankings, then full chunk text. Includes page numbers for PDF sources. No Ollama needed.
python clippings_search/retrieve_clippings.py "creativity and innovation"
Output format
Response:
<LLM-synthesized answer citing specific files>
Source documents:
2024-03-15.txt ./data/2024-03-15.txt 0.683
2023-11-02.txt ./data/2023-11-02.txt 0.651
...
Configuration
Key parameters (set in source files):
| Parameter | Value | Location |
|---|---|---|
| Embedding model | BAAI/bge-large-en-v1.5 |
all build and query scripts |
| Chunk size | 256 tokens | build_store.py, clippings_search/build_clippings.py |
| Chunk overlap | 25 tokens | build_store.py, clippings_search/build_clippings.py |
| Paragraph separator | \n\n |
build_store.py |
| Initial retrieval | 30 chunks | query and retrieve scripts |
| Re-rank model | cross-encoder/ms-marco-MiniLM-L-12-v2 |
query and retrieve scripts |
| Re-rank top-n | 15 | query and retrieve scripts |
| LLM | command-r7b (Ollama) or gpt-4o-mini (OpenAI API) |
query_hybrid.py |
| Temperature | 0.3 | query_hybrid.py |
| Context window | 8000 tokens | query_hybrid.py |
| Request timeout | 360 seconds | query_hybrid.py |
Key dependencies
- llama-index-core (0.14.14) -- RAG framework
- llama-index-embeddings-huggingface -- embedding integration
- llama-index-vector-stores-chroma -- ChromaDB vector store for clippings
- llama-index-llms-ollama -- local LLM via Ollama
- llama-index-llms-openai -- OpenAI API LLM (optional)
- llama-index-retrievers-bm25 -- BM25 sparse retrieval for hybrid search
- chromadb -- persistent vector store for clippings index
- sentence-transformers -- cross-encoder re-ranking
- torch -- ML runtime
Notebooks
Three Jupyter notebooks document exploration and analysis:
-
hyde.ipynb-- Experiments with HyDE (Hypothetical Document Embeddings) query rewriting. Finding: did not improve retrieval quality over direct prompt engineering. -
sandbox.ipynb-- Exploratory notebook for learning the LlamaIndex API. -
vs_metrics.ipynb-- Quantitative analysis of the vector store (embedding distributions, pairwise similarity, clustering, PCA/t-SNE projections).
Design decisions
- BAAI/bge-large-en-v1.5 over all-mpnet-base-v2: Better semantic matching quality for journal text despite slower embedding.
- 256-token chunks: Tested 512 and 384; 256 with 25-token overlap produced the highest quality matches.
- command-r7b over llama3.1:8B: Sticks closer to provided context with less hallucination at comparable speed.
- Cross-encoder re-ranking: Retrieve top-30 via bi-encoder, re-rank to top-15 with a cross-encoder that scores each (query, chunk) pair jointly. Tested three models;
ms-marco-MiniLM-L-12-v2selected overstsb-roberta-base(wrong task) andBAAI/bge-reranker-v2-m3(slower, weak score tail). - HyDE query rewriting tested and dropped: Did not improve results over direct prompt engineering.
- Hybrid BM25 + vector retrieval: BM25 nominates candidates with exact term matches that embeddings miss; the cross-encoder decides final relevance.
- ChromaDB for clippings: Persistent SQLite-backed store. Chosen over the JSON store for its metadata filtering and direct chunk-level operations for incremental updates.
- PDF validation before indexing: Pre-check each PDF with pypdf — skip if text extraction yields <100 chars or low printable ratio. Skipped files written to
ocr_needed.txt.
Development history
- Aug 2025: Initial implementation -- build pipeline, embedding model comparison, chunk size experiments, HyDE testing.
- Jan 2026: Command-line interface, prompt improvements, model comparison (command-r7b selected).
- Feb 2026: Cross-encoder re-ranking, hybrid BM25+vector retrieval, LlamaIndex upgrade to 0.14.14, OpenAI API backend, incremental updates, clippings search (ChromaDB), project reorganization.
See devlog.txt for detailed development notes and experimental findings.