Remove references to files not in public branch (notebooks, archived/, devlog, NOTES.md, v2 scripts). Add clippings search documentation (build_clippings.py, retrieve_clippings.py, ChromaDB). <noreply@anthropic.com>
8.8 KiB
ssearch
Semantic search over a journal archive and a collection of clippings (articles, PDFs, web saves). Uses vector embeddings and a local LLM to find and synthesize information across dated journal entries and a pdf library of clippings files.
How it works
Query → Embed (BAAI/bge-large-en-v1.5) → Vector similarity (top-30) → Cross-encoder re-rank (top-15) → LLM synthesis (command-r7b via Ollama, or OpenAI API) → Response + sources
- Build: Source files are chunked (256 tokens, 25-token overlap) and embedded into a vector store using LlamaIndex. The journal index uses LlamaIndex's JSON store; the clippings index uses ChromaDB. Both support incremental updates.
- Retrieve: A user query is embedded with the same model and matched against stored vectors by cosine similarity, returning the top 30 candidate chunks.
- Re-rank: A cross-encoder (
cross-encoder/ms-marco-MiniLM-L-12-v2) scores each (query, chunk) pair jointly and keeps the top 15. - Synthesize: The re-ranked chunks are passed to a local LLM with a custom prompt that encourages multi-source synthesis, producing a grounded answer with file citations.
Project structure
ssearch/
├── build_exp_claude.py # Build/update journal vector store (incremental)
├── build_clippings.py # Build/update clippings vector store (ChromaDB)
├── query_topk_prompt_engine_v3.py # Main query engine (cross-encoder re-ranking)
├── query_hybrid_bm25_v4.py # Hybrid BM25 + vector query
├── retrieve_raw.py # Verbatim journal chunk retrieval (no LLM)
├── retrieve_hybrid_raw.py # Hybrid verbatim retrieval (no LLM)
├── retrieve_clippings.py # Verbatim clippings chunk retrieval (no LLM)
├── search_keywords.py # Keyword search via POS-based term extraction
├── run_query.sh # Shell wrapper with timing and logging
├── data/ # Symlink to journal .txt files
├── clippings/ # Symlink to clippings (PDFs, TXT, webarchive, RTF)
├── storage_exp/ # Persisted journal vector store (~242 MB)
├── storage_clippings/ # Persisted clippings vector store (ChromaDB)
├── models/ # Cached HuggingFace models (offline)
└── requirements.txt # Python dependencies
Setup
Prerequisites: Python 3.12, Ollama with command-r7b pulled.
cd ssearch
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
The data/ symlink should point to the journal archive (plain .txt files). The clippings/ symlink should point to the clippings folder (PDFs, TXT, webarchive, RTF). The embedding model (BAAI/bge-large-en-v1.5) and cross-encoder (cross-encoder/ms-marco-MiniLM-L-12-v2) are cached in ./models/ for offline use.
Offline model loading
All query scripts set three environment variables to prevent HuggingFace from making network requests:
os.environ["TOKENIZERS_PARALLELISM"] = "false"
os.environ["SENTENCE_TRANSFORMERS_HOME"] = "./models"
os.environ["HF_HUB_OFFLINE"] = "1"
These must appear before any imports that touch HuggingFace libraries. The huggingface_hub library evaluates HF_HUB_OFFLINE once at import time (in huggingface_hub/constants.py). If the env var is set after imports, the library will still attempt network access and fail offline.
Usage
Build the vector stores
# Journal index -- incremental update (default)
python build_exp_claude.py
# Journal index -- full rebuild
python build_exp_claude.py --rebuild
# Clippings index -- incremental update (default)
python build_clippings.py
# Clippings index -- full rebuild
python build_clippings.py --rebuild
The default incremental mode loads the existing index, compares file sizes and modification dates, and only re-indexes what changed. A full rebuild is only needed when chunk parameters or the embedding model change.
build_clippings.py handles PDFs, TXT, webarchive, and RTF files. PDFs are validated before indexing -- those without extractable text (scanned, encrypted) are skipped and written to ocr_needed.txt for later OCR processing.
Search journals
Three categories of search are available, from heaviest (semantic + LLM) to lightest (grep).
Semantic search with LLM synthesis
Requires Ollama running with command-r7b.
Vector-only (query_topk_prompt_engine_v3.py): Retrieves the top 30 chunks by cosine similarity, re-ranks to top 15, synthesizes.
python query_topk_prompt_engine_v3.py "What does the author say about creativity?"
Hybrid BM25 + vector (query_hybrid_bm25_v4.py): Retrieves top 20 by vector similarity and top 20 by BM25 term frequency, merges and deduplicates, re-ranks the union to top 15, synthesizes. Catches exact name/term matches that vector-only retrieval misses.
python query_hybrid_bm25_v4.py "Louis Menand"
Interactive wrapper (run_query.sh): Loops for queries using the v3 engine, displays timing, and appends queries to query.log.
./run_query.sh
Verbatim chunk retrieval (no LLM)
Same retrieval and re-ranking pipeline but outputs raw chunk text. No Ollama needed.
Vector-only (retrieve_raw.py):
python retrieve_raw.py "Kondiaronk and the Wendats"
Hybrid BM25 + vector (retrieve_hybrid_raw.py): Each chunk is annotated with its source: [vector-only], [bm25-only], or [vector+bm25].
python retrieve_hybrid_raw.py "Louis Menand"
Keyword search (no vector store, no LLM)
search_keywords.py: Extracts nouns and adjectives from the query using NLTK POS tagging, then greps the journal files for matches with surrounding context.
python search_keywords.py "Discussions of Kondiaronk and the Wendats"
Search clippings
retrieve_clippings.py: Verbatim chunk retrieval from the clippings index. Same embedding model and cross-encoder re-ranking. Outputs a summary of source files and rankings, then the full chunk text. No Ollama needed.
python retrieve_clippings.py "creativity and innovation"
Output includes page numbers for PDF sources when available.
Configuration
Key parameters (set in source files):
| Parameter | Value | Location |
|---|---|---|
| Embedding model | BAAI/bge-large-en-v1.5 |
all build and query scripts |
| Chunk size | 256 tokens | build_exp_claude.py, build_clippings.py |
| Chunk overlap | 25 tokens | build_exp_claude.py, build_clippings.py |
| Initial retrieval | 30 chunks | query and retrieve scripts |
| Re-rank model | cross-encoder/ms-marco-MiniLM-L-12-v2 |
query and retrieve scripts |
| Re-rank top-n | 15 | query and retrieve scripts |
| LLM | command-r7b (Ollama) or gpt-4o-mini (OpenAI API) |
query scripts |
| Temperature | 0.3 | query scripts |
| Context window | 8000 tokens | query_topk_prompt_engine_v3.py |
Key dependencies
- llama-index-core (0.14.14) -- RAG framework
- llama-index-embeddings-huggingface -- embedding integration
- llama-index-vector-stores-chroma -- ChromaDB vector store for clippings
- llama-index-llms-ollama -- local LLM via Ollama
- llama-index-llms-openai -- OpenAI API LLM (optional)
- llama-index-retrievers-bm25 -- BM25 sparse retrieval for hybrid search
- chromadb -- persistent vector store for clippings index
- sentence-transformers -- cross-encoder re-ranking
- torch -- ML runtime
Design decisions
- BAAI/bge-large-en-v1.5 over all-mpnet-base-v2: Better semantic matching quality for journal text despite slower embedding.
- 256-token chunks: Tested 512 and 384; 256 with 25-token overlap produced the highest quality matches.
- command-r7b over llama3.1:8B: Sticks closer to provided context with less hallucination at comparable speed.
- Cross-encoder re-ranking: Retrieve top-30 via bi-encoder, re-rank to top-15 with a cross-encoder that scores each (query, chunk) pair jointly. Tested three models;
ms-marco-MiniLM-L-12-v2selected overstsb-roberta-base(wrong task) andBAAI/bge-reranker-v2-m3(slower, weak score tail). - Hybrid BM25 + vector retrieval: BM25 nominates candidates with exact term matches that embeddings miss; the cross-encoder decides final relevance.
- ChromaDB for clippings: Persistent SQLite-backed store. Chosen over the JSON store used for journals because the clippings index handles more diverse file types and benefits from ChromaDB's metadata filtering and direct chunk-level operations for incremental updates.
- PDF validation before indexing: Pre-check each PDF with pypdf -- skip if text extraction yields <100 chars or low printable ratio. Skipped files are written to
ocr_needed.txtfor later OCR processing.