ssearch/README.md
Eric d7061fe9c2 Update README for public branch with clippings search
Remove references to files not in public branch (notebooks, archived/,
devlog, NOTES.md, v2 scripts). Add clippings search documentation
(build_clippings.py, retrieve_clippings.py, ChromaDB).
<noreply@anthropic.com>
2026-02-22 09:17:51 -05:00

8.8 KiB

ssearch

Semantic search over a journal archive and a collection of clippings (articles, PDFs, web saves). Uses vector embeddings and a local LLM to find and synthesize information across dated journal entries and a pdf library of clippings files.

How it works

Query → Embed (BAAI/bge-large-en-v1.5) → Vector similarity (top-30) → Cross-encoder re-rank (top-15) → LLM synthesis (command-r7b via Ollama, or OpenAI API) → Response + sources
  1. Build: Source files are chunked (256 tokens, 25-token overlap) and embedded into a vector store using LlamaIndex. The journal index uses LlamaIndex's JSON store; the clippings index uses ChromaDB. Both support incremental updates.
  2. Retrieve: A user query is embedded with the same model and matched against stored vectors by cosine similarity, returning the top 30 candidate chunks.
  3. Re-rank: A cross-encoder (cross-encoder/ms-marco-MiniLM-L-12-v2) scores each (query, chunk) pair jointly and keeps the top 15.
  4. Synthesize: The re-ranked chunks are passed to a local LLM with a custom prompt that encourages multi-source synthesis, producing a grounded answer with file citations.

Project structure

ssearch/
├── build_exp_claude.py             # Build/update journal vector store (incremental)
├── build_clippings.py              # Build/update clippings vector store (ChromaDB)
├── query_topk_prompt_engine_v3.py  # Main query engine (cross-encoder re-ranking)
├── query_hybrid_bm25_v4.py         # Hybrid BM25 + vector query
├── retrieve_raw.py                 # Verbatim journal chunk retrieval (no LLM)
├── retrieve_hybrid_raw.py          # Hybrid verbatim retrieval (no LLM)
├── retrieve_clippings.py           # Verbatim clippings chunk retrieval (no LLM)
├── search_keywords.py              # Keyword search via POS-based term extraction
├── run_query.sh                    # Shell wrapper with timing and logging
├── data/                           # Symlink to journal .txt files
├── clippings/                      # Symlink to clippings (PDFs, TXT, webarchive, RTF)
├── storage_exp/                    # Persisted journal vector store (~242 MB)
├── storage_clippings/              # Persisted clippings vector store (ChromaDB)
├── models/                         # Cached HuggingFace models (offline)
└── requirements.txt                # Python dependencies

Setup

Prerequisites: Python 3.12, Ollama with command-r7b pulled.

cd ssearch
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

The data/ symlink should point to the journal archive (plain .txt files). The clippings/ symlink should point to the clippings folder (PDFs, TXT, webarchive, RTF). The embedding model (BAAI/bge-large-en-v1.5) and cross-encoder (cross-encoder/ms-marco-MiniLM-L-12-v2) are cached in ./models/ for offline use.

Offline model loading

All query scripts set three environment variables to prevent HuggingFace from making network requests:

os.environ["TOKENIZERS_PARALLELISM"] = "false"
os.environ["SENTENCE_TRANSFORMERS_HOME"] = "./models"
os.environ["HF_HUB_OFFLINE"] = "1"

These must appear before any imports that touch HuggingFace libraries. The huggingface_hub library evaluates HF_HUB_OFFLINE once at import time (in huggingface_hub/constants.py). If the env var is set after imports, the library will still attempt network access and fail offline.

Usage

Build the vector stores

# Journal index -- incremental update (default)
python build_exp_claude.py

# Journal index -- full rebuild
python build_exp_claude.py --rebuild

# Clippings index -- incremental update (default)
python build_clippings.py

# Clippings index -- full rebuild
python build_clippings.py --rebuild

The default incremental mode loads the existing index, compares file sizes and modification dates, and only re-indexes what changed. A full rebuild is only needed when chunk parameters or the embedding model change.

build_clippings.py handles PDFs, TXT, webarchive, and RTF files. PDFs are validated before indexing -- those without extractable text (scanned, encrypted) are skipped and written to ocr_needed.txt for later OCR processing.

Search journals

Three categories of search are available, from heaviest (semantic + LLM) to lightest (grep).

Semantic search with LLM synthesis

Requires Ollama running with command-r7b.

Vector-only (query_topk_prompt_engine_v3.py): Retrieves the top 30 chunks by cosine similarity, re-ranks to top 15, synthesizes.

python query_topk_prompt_engine_v3.py "What does the author say about creativity?"

Hybrid BM25 + vector (query_hybrid_bm25_v4.py): Retrieves top 20 by vector similarity and top 20 by BM25 term frequency, merges and deduplicates, re-ranks the union to top 15, synthesizes. Catches exact name/term matches that vector-only retrieval misses.

python query_hybrid_bm25_v4.py "Louis Menand"

Interactive wrapper (run_query.sh): Loops for queries using the v3 engine, displays timing, and appends queries to query.log.

./run_query.sh

Verbatim chunk retrieval (no LLM)

Same retrieval and re-ranking pipeline but outputs raw chunk text. No Ollama needed.

Vector-only (retrieve_raw.py):

python retrieve_raw.py "Kondiaronk and the Wendats"

Hybrid BM25 + vector (retrieve_hybrid_raw.py): Each chunk is annotated with its source: [vector-only], [bm25-only], or [vector+bm25].

python retrieve_hybrid_raw.py "Louis Menand"

Keyword search (no vector store, no LLM)

search_keywords.py: Extracts nouns and adjectives from the query using NLTK POS tagging, then greps the journal files for matches with surrounding context.

python search_keywords.py "Discussions of Kondiaronk and the Wendats"

Search clippings

retrieve_clippings.py: Verbatim chunk retrieval from the clippings index. Same embedding model and cross-encoder re-ranking. Outputs a summary of source files and rankings, then the full chunk text. No Ollama needed.

python retrieve_clippings.py "creativity and innovation"

Output includes page numbers for PDF sources when available.

Configuration

Key parameters (set in source files):

Parameter Value Location
Embedding model BAAI/bge-large-en-v1.5 all build and query scripts
Chunk size 256 tokens build_exp_claude.py, build_clippings.py
Chunk overlap 25 tokens build_exp_claude.py, build_clippings.py
Initial retrieval 30 chunks query and retrieve scripts
Re-rank model cross-encoder/ms-marco-MiniLM-L-12-v2 query and retrieve scripts
Re-rank top-n 15 query and retrieve scripts
LLM command-r7b (Ollama) or gpt-4o-mini (OpenAI API) query scripts
Temperature 0.3 query scripts
Context window 8000 tokens query_topk_prompt_engine_v3.py

Key dependencies

  • llama-index-core (0.14.14) -- RAG framework
  • llama-index-embeddings-huggingface -- embedding integration
  • llama-index-vector-stores-chroma -- ChromaDB vector store for clippings
  • llama-index-llms-ollama -- local LLM via Ollama
  • llama-index-llms-openai -- OpenAI API LLM (optional)
  • llama-index-retrievers-bm25 -- BM25 sparse retrieval for hybrid search
  • chromadb -- persistent vector store for clippings index
  • sentence-transformers -- cross-encoder re-ranking
  • torch -- ML runtime

Design decisions

  • BAAI/bge-large-en-v1.5 over all-mpnet-base-v2: Better semantic matching quality for journal text despite slower embedding.
  • 256-token chunks: Tested 512 and 384; 256 with 25-token overlap produced the highest quality matches.
  • command-r7b over llama3.1:8B: Sticks closer to provided context with less hallucination at comparable speed.
  • Cross-encoder re-ranking: Retrieve top-30 via bi-encoder, re-rank to top-15 with a cross-encoder that scores each (query, chunk) pair jointly. Tested three models; ms-marco-MiniLM-L-12-v2 selected over stsb-roberta-base (wrong task) and BAAI/bge-reranker-v2-m3 (slower, weak score tail).
  • Hybrid BM25 + vector retrieval: BM25 nominates candidates with exact term matches that embeddings miss; the cross-encoder decides final relevance.
  • ChromaDB for clippings: Persistent SQLite-backed store. Chosen over the JSON store used for journals because the clippings index handles more diverse file types and benefits from ChromaDB's metadata filtering and direct chunk-level operations for incremental updates.
  • PDF validation before indexing: Pre-check each PDF with pypdf -- skip if text extraction yields <100 chars or low printable ratio. Skipped files are written to ocr_needed.txt for later OCR processing.