Reorganize project: rename scripts, archive superseded, add clippings_search/

- Rename build_exp_claude.py → build_store.py - Rename query_hybrid_bm25_v4.py → query_hybrid.py - Rename retrieve_hybrid_raw.py → retrieve.py - Archive query_topk_prompt_engine_v3.py (superseded by hybrid) - Archive retrieve_raw.py (superseded by hybrid) - Move build_clippings.py, retrieve_clippings.py → clippings_search/ - Update run_query.sh, README.md, CLAUDE.md for new names
2026-02-26 16:24:32 -05:00 · 2026-02-26 16:24:32 -05:00 · 5a3294f74c
commit 5a3294f74c
parent b4bf89ce4b
9 changed files with 80 additions and 87 deletions
--- a/README.md
+++ b/README.md
@ -1,6 +1,6 @@
 # ssearch

-Semantic search over a personal journal archive. Uses vector embeddings and a local LLM to find and synthesize information across 1800+ dated text entries spanning 2000-2025.
+Semantic search over a personal journal archive and a collection of clippings. Uses vector embeddings and a local LLM to find and synthesize information across 1800+ dated text entries spanning 2000-2025, plus a library of PDFs, articles, and web saves.

 ## How it works

@ -8,7 +8,7 @@ Semantic search over a personal journal archive. Uses vector embeddings and a lo
 Query → Embed (BAAI/bge-large-en-v1.5) → Vector similarity (top-30) → Cross-encoder re-rank (top-15) → LLM synthesis (command-r7b via Ollama, or OpenAI API) → Response + sources
 ```

-1. **Build**: Journal entries in `./data` are chunked (256 tokens, 25-token overlap) and embedded into a vector store using LlamaIndex. Supports incremental updates (new/modified files only) or full rebuilds.
+1. **Build**: Source files are chunked (256 tokens, 25-token overlap) and embedded into a vector store using LlamaIndex. The journal index uses LlamaIndex's JSON store; the clippings index uses ChromaDB. Both support incremental updates.
 2. **Retrieve**: A user query is embedded with the same model and matched against stored vectors by cosine similarity, returning the top 30 candidate chunks.
 3. **Re-rank**: A cross-encoder (`cross-encoder/ms-marco-MiniLM-L-12-v2`) scores each (query, chunk) pair jointly and keeps the top 15.
 4. **Synthesize**: The re-ranked chunks are passed to a local LLM with a custom prompt that encourages multi-source synthesis, producing a grounded answer with file citations.
@ -17,23 +17,24 @@ Query → Embed (BAAI/bge-large-en-v1.5) → Vector similarity (top-30) → Cros

 ```
 ssearch/
-├── build_exp_claude.py             # Build/update vector store (incremental by default)
-├── query_topk_prompt_engine_v3.py  # Main query engine (cross-encoder re-ranking)
-├── query_topk_prompt_engine_v2.py  # Previous query engine (no re-ranking)
-├── retrieve_raw.py                 # Verbatim chunk retrieval (no LLM)
-├── query_hybrid_bm25_v4.py        # Hybrid BM25 + vector query (v4)
-├── retrieve_hybrid_raw.py          # Hybrid verbatim retrieval (no LLM)
-├── search_keywords.py              # Keyword search via POS-based term extraction
-├── run_query.sh                    # Shell wrapper with timing and logging
-├── data/                           # Symlink to ../text/ (journal .txt files)
-├── storage_exp/                    # Persisted vector store (~242 MB)
-├── models/                         # Cached HuggingFace models (embedding + cross-encoder, offline)
-├── archived/                       # Earlier iterations and prototypes
-├── saved_output/                   # Saved query results and model comparisons
-├── requirements.txt                # Python dependencies (pip freeze)
-├── NOTES.md                        # Similarity metric reference
-├── devlog.txt                      # Development log and experimental findings
-└── *.ipynb                         # Jupyter notebooks (HyDE, metrics, sandbox)
+├── build_store.py              # Build/update journal vector store (incremental)
+├── query_hybrid.py             # Hybrid BM25+vector query with LLM synthesis
+├── retrieve.py                 # Verbatim hybrid retrieval (no LLM)
+├── search_keywords.py          # Keyword search via POS-based term extraction
+├── run_query.sh                # Interactive shell wrapper with timing and logging
+├── clippings_search/
+│   ├── build_clippings.py      # Build/update clippings vector store (ChromaDB)
+│   └── retrieve_clippings.py   # Verbatim clippings chunk retrieval
+├── data/                       # Symlink to journal .txt files
+├── clippings/                  # Symlink to clippings (PDFs, TXT, webarchive, RTF)
+├── storage_exp/                # Persisted journal vector store (~242 MB)
+├── storage_clippings/          # Persisted clippings vector store (ChromaDB)
+├── models/                     # Cached HuggingFace models (offline)
+├── archived/                   # Superseded script versions
+├── saved_output/               # Saved query results and model comparisons
+├── requirements.txt            # Python dependencies
+├── devlog.txt                  # Development log and experimental findings
+└── *.ipynb                     # Jupyter notebooks (HyDE, metrics, sandbox)
 ```

 ## Setup
@ -47,7 +48,7 @@ source .venv/bin/activate
 pip install -r requirements.txt
 ```

-The `data/` symlink should point to `../text/` (the journal archive). The embedding model (`BAAI/bge-large-en-v1.5`) and cross-encoder (`cross-encoder/ms-marco-MiniLM-L-12-v2`) are cached in `./models/` for offline use.
+The `data/` symlink should point to the journal archive (plain `.txt` files). The `clippings/` symlink should point to the clippings folder. The embedding model (`BAAI/bge-large-en-v1.5`) and cross-encoder (`cross-encoder/ms-marco-MiniLM-L-12-v2`) are cached in `./models/` for offline use.

 ### Offline model loading

@ -59,74 +60,75 @@ os.environ["SENTENCE_TRANSFORMERS_HOME"] = "./models"
 os.environ["HF_HUB_OFFLINE"] = "1"
 ```

-**These must appear before any imports that touch HuggingFace libraries.** The `huggingface_hub` library evaluates `HF_HUB_OFFLINE` once at import time (in `huggingface_hub/constants.py`). If the env var is set after imports, the library will still attempt network access and fail offline. This is a common pitfall -- `llama_index.embeddings.huggingface` transitively imports `huggingface_hub`, so even indirect imports trigger the evaluation.
+**These must appear before any imports that touch HuggingFace libraries.** The `huggingface_hub` library evaluates `HF_HUB_OFFLINE` once at import time (in `huggingface_hub/constants.py`). If the env var is set after imports, the library will still attempt network access and fail offline.

 Alternatively, set the variable in your shell before running Python:
 ```bash
 export HF_HUB_OFFLINE=1
-python query_hybrid_bm25_v4.py "your query"
+python query_hybrid.py "your query"
 ```

 ## Usage

-### Build the vector store
+### Build the vector stores

 ```bash
-# Incremental update (default): only processes new, modified, or deleted files
-python build_exp_claude.py
+# Journal index -- incremental update (default)
+python build_store.py

-# Full rebuild from scratch
-python build_exp_claude.py --rebuild
+# Journal index -- full rebuild
+python build_store.py --rebuild
+
+# Clippings index -- incremental update (default)
+python clippings_search/build_clippings.py
+
+# Clippings index -- full rebuild
+python clippings_search/build_clippings.py --rebuild
 ```

-The default incremental mode loads the existing index, compares file sizes and modification dates against the docstore, and only re-indexes what changed. A full rebuild (`--rebuild`) is only needed when chunk parameters or the embedding model change.
+The default incremental mode loads the existing index, compares file sizes and modification dates, and only re-indexes what changed. A full rebuild is only needed when chunk parameters or the embedding model change.

-### Search
+`build_clippings.py` handles PDFs, TXT, webarchive, and RTF files. PDFs are validated before indexing — those without extractable text are skipped and written to `ocr_needed.txt` for later OCR.

-Three categories of search are available, from heaviest (semantic + LLM) to lightest (grep).
+### Search journals

 #### Semantic search with LLM synthesis

-These scripts embed the query, retrieve candidate chunks from the vector store, re-rank with a cross-encoder, and pass the top results to a local LLM that synthesizes a grounded answer with file citations. **Requires Ollama running with `command-r7b`.**
+**Requires Ollama running with `command-r7b`.**

-**Vector-only** (`query_topk_prompt_engine_v3.py`): Retrieves the top 30 chunks by cosine similarity, re-ranks to top 15, synthesizes.
+**Hybrid BM25 + vector** (`query_hybrid.py`): Retrieves top 20 by vector similarity and top 20 by BM25 term frequency, merges and deduplicates, re-ranks the union to top 15, synthesizes. Catches exact name/term matches that vector-only retrieval misses.
 ```bash
-python query_topk_prompt_engine_v3.py "What does the author say about creativity?"
+python query_hybrid.py "What does the author say about creativity?"
 ```

-**Hybrid BM25 + vector** (`query_hybrid_bm25_v4.py`): Retrieves top 20 by vector similarity and top 20 by BM25 term frequency, merges and deduplicates, re-ranks the union to top 15, synthesizes. Catches exact name/term matches that vector-only retrieval misses.
-```bash
-python query_hybrid_bm25_v4.py "Louis Menand"
-```
-
-**Interactive wrapper** (`run_query.sh`): Loops for queries using the v3 engine, displays timing, and appends queries to `query.log`.
+**Interactive wrapper** (`run_query.sh`): Loops for queries, displays timing, and appends queries to `query.log`.
 ```bash
 ./run_query.sh
 ```

 #### Verbatim chunk retrieval (no LLM)

-These scripts run the same retrieval and re-ranking pipeline but output the raw chunk text instead of passing it to an LLM. Useful for inspecting what the retrieval pipeline finds, or when Ollama is not available. **No Ollama needed.**
+Same hybrid retrieval and re-ranking pipeline but outputs raw chunk text. Each chunk is annotated with its source: `[vector-only]`, `[bm25-only]`, or `[vector+bm25]`. **No Ollama needed.**

-**Vector-only** (`retrieve_raw.py`): Top-30 vector retrieval, cross-encoder re-rank to top 15, raw output.
 ```bash
-python retrieve_raw.py "Kondiaronk and the Wendats"
+python retrieve.py "Kondiaronk and the Wendats"
 ```

-**Hybrid BM25 + vector** (`retrieve_hybrid_raw.py`): Same hybrid retrieval as v4 but outputs raw chunks. Each chunk is annotated with its source: `[vector-only]`, `[bm25-only]`, or `[vector+bm25]`.
-```bash
-python retrieve_hybrid_raw.py "Louis Menand"
-```
-
-Pipe either to `less` for browsing.
-
 #### Keyword search (no vector store, no LLM)

-**`search_keywords.py`**: Extracts nouns and adjectives from the query using NLTK POS tagging, then greps `./data/*.txt` for matches with surrounding context. A lightweight fallback when you want exact string matching without the vector store. **No vector store or Ollama needed.**
+Extracts nouns and adjectives from the query using NLTK POS tagging, then greps journal files for matches with surrounding context.
 ```bash
 python search_keywords.py "Discussions of Kondiaronk and the Wendats"
 ```

+### Search clippings
+
+Verbatim chunk retrieval from the clippings index. Same embedding model and cross-encoder re-ranking. Outputs a summary of source files and rankings, then full chunk text. Includes page numbers for PDF sources. **No Ollama needed.**
+
+```bash
+python clippings_search/retrieve_clippings.py "creativity and innovation"
+```
+
 ### Output format

 ```
@ -145,64 +147,55 @@ Key parameters (set in source files):

 | Parameter | Value | Location |
 |-----------|-------|----------|
-| Embedding model | `BAAI/bge-large-en-v1.5` | `build_exp_claude.py`, `query_topk_prompt_engine_v3.py` |
-| Chunk size | 256 tokens | `build_exp_claude.py` |
-| Chunk overlap | 25 tokens | `build_exp_claude.py` |
-| Paragraph separator | `\n\n` | `build_exp_claude.py` |
-| Initial retrieval | 30 chunks | `query_topk_prompt_engine_v3.py` |
-| Re-rank model | `cross-encoder/ms-marco-MiniLM-L-12-v2` | `query_topk_prompt_engine_v3.py` |
-| Re-rank top-n | 15 | `query_topk_prompt_engine_v3.py` |
-| LLM | `command-r7b` (Ollama) or `gpt-4o-mini` (OpenAI API) | `query_topk_prompt_engine_v3.py`, `query_hybrid_bm25_v4.py` |
-| Temperature | 0.3 (recommended for both local and API models) | `query_topk_prompt_engine_v3.py`, `query_hybrid_bm25_v4.py` |
-| Context window | 8000 tokens | `query_topk_prompt_engine_v3.py` |
-| Request timeout | 360 seconds | `query_topk_prompt_engine_v3.py` |
+| Embedding model | `BAAI/bge-large-en-v1.5` | all build and query scripts |
+| Chunk size | 256 tokens | `build_store.py`, `clippings_search/build_clippings.py` |
+| Chunk overlap | 25 tokens | `build_store.py`, `clippings_search/build_clippings.py` |
+| Paragraph separator | `\n\n` | `build_store.py` |
+| Initial retrieval | 30 chunks | query and retrieve scripts |
+| Re-rank model | `cross-encoder/ms-marco-MiniLM-L-12-v2` | query and retrieve scripts |
+| Re-rank top-n | 15 | query and retrieve scripts |
+| LLM | `command-r7b` (Ollama) or `gpt-4o-mini` (OpenAI API) | `query_hybrid.py` |
+| Temperature | 0.3 | `query_hybrid.py` |
+| Context window | 8000 tokens | `query_hybrid.py` |
+| Request timeout | 360 seconds | `query_hybrid.py` |

 ## Key dependencies

 - **llama-index-core** (0.14.14) -- RAG framework
- **llama-index-embeddings-huggingface** (0.6.1) -- embedding integration
- **llama-index-llms-ollama** (0.9.1) -- local LLM via Ollama
- **llama-index-llms-openai** (0.6.18) -- OpenAI API LLM (optional, for API-based synthesis)
- **llama-index-readers-file** (0.5.6) -- file readers
- **llama-index-retrievers-bm25** (0.6.5) -- BM25 sparse retrieval for hybrid search
- **sentence-transformers** (5.1.0) -- embedding model support
- **torch** (2.8.0) -- ML runtime
+- **llama-index-embeddings-huggingface** -- embedding integration
+- **llama-index-vector-stores-chroma** -- ChromaDB vector store for clippings
+- **llama-index-llms-ollama** -- local LLM via Ollama
+- **llama-index-llms-openai** -- OpenAI API LLM (optional)
+- **llama-index-retrievers-bm25** -- BM25 sparse retrieval for hybrid search
+- **chromadb** -- persistent vector store for clippings index
+- **sentence-transformers** -- cross-encoder re-ranking
+- **torch** -- ML runtime

 ## Notebooks

 Three Jupyter notebooks document exploration and analysis:

- **`hyde.ipynb`** -- Experiments with HyDE (Hypothetical Document Embeddings) query rewriting. Tests whether generating a hypothetical answer to a query and embedding that instead improves retrieval. Uses LlamaIndex's `HyDEQueryTransform` with `llama3.1:8B`. Finding: the default HyDE prompt produced a rich hypothetical passage, but the technique did not improve retrieval quality over direct prompt engineering. This informed the decision to drop HyDE from the pipeline.
+- **`hyde.ipynb`** -- Experiments with HyDE (Hypothetical Document Embeddings) query rewriting. Finding: did not improve retrieval quality over direct prompt engineering.

- **`sandbox.ipynb`** -- Exploratory notebook for learning the LlamaIndex API. Inspects the `llama_index.core` module (104 objects), lists available classes and methods, and reads the source of `VectorStoreIndex`. Useful as a quick reference for what LlamaIndex exposes.
+- **`sandbox.ipynb`** -- Exploratory notebook for learning the LlamaIndex API.

- **`vs_metrics.ipynb`** -- Quantitative analysis of the vector store. Loads the persisted index (4,692 vectors, 1024 dimensions each from `BAAI/bge-large-en-v1.5`) and produces:
-  - Distribution of embedding values (histogram)
-  - Heatmap of the full embedding matrix
-  - Embedding vector magnitude distribution
-  - Per-dimension variance (which dimensions carry more signal)
-  - Pairwise cosine similarity distribution and heatmap (subset)
-  - Hierarchical clustering dendrogram (Ward linkage)
-  - PCA and t-SNE 2D projections of the embedding space
+- **`vs_metrics.ipynb`** -- Quantitative analysis of the vector store (embedding distributions, pairwise similarity, clustering, PCA/t-SNE projections).

 ## Design decisions

 - **BAAI/bge-large-en-v1.5 over all-mpnet-base-v2**: Better semantic matching quality for journal text despite slower embedding.
 - **256-token chunks**: Tested 512 and 384; 256 with 25-token overlap produced the highest quality matches.
 - **command-r7b over llama3.1:8B**: Sticks closer to provided context with less hallucination at comparable speed.
- **Top-k=15**: Wide enough to capture diverse perspectives, narrow enough to fit the context window.
- **Cross-encoder re-ranking (v3)**: Retrieve top-30 via bi-encoder, re-rank to top-15 with a cross-encoder that scores each (query, chunk) pair jointly. More accurate than bi-encoder similarity alone. Tested three models; `ms-marco-MiniLM-L-12-v2` selected over `stsb-roberta-base` (wrong task -- semantic similarity, not passage ranking) and `BAAI/bge-reranker-v2-m3` (50% slower, weak score tail).
+- **Cross-encoder re-ranking**: Retrieve top-30 via bi-encoder, re-rank to top-15 with a cross-encoder that scores each (query, chunk) pair jointly. Tested three models; `ms-marco-MiniLM-L-12-v2` selected over `stsb-roberta-base` (wrong task) and `BAAI/bge-reranker-v2-m3` (slower, weak score tail).
 - **HyDE query rewriting tested and dropped**: Did not improve results over direct prompt engineering.
- **V3 prompt**: Adapted for re-ranked context -- tells the LLM all excerpts have been curated, encourages examining every chunk and noting what each file contributes. Produces better multi-source synthesis than v2's prompt.
- **V2 prompt**: More flexible and query-adaptive than v1, which forced rigid structure (exactly 10 files, mandatory theme).
- **Verbatim retrieval (`retrieve_raw.py`)**: Uses LlamaIndex's `index.as_retriever()` instead of `index.as_query_engine()`. The retriever returns raw `NodeWithScore` objects (chunk text, metadata, scores) without invoking the LLM. The re-ranker is applied manually via `reranker.postprocess_nodes()`. This separation lets you inspect what the pipeline retrieves before synthesis.
- **Keyword search (`search_keywords.py`)**: NLTK POS tagging extracts nouns and adjectives from the query -- a middle ground between naive stopword removal and LLM-based term extraction. Catches exact names, places, and dates that vector similarity misses.
- **Hybrid BM25 + vector retrieval (v4)**: Runs two retrievers in parallel -- BM25 (top-20 by term frequency) and vector similarity (top-20 by cosine) -- merges and deduplicates candidates, then lets the cross-encoder re-rank the union to top-15. BM25 nominates candidates with exact term matches that embeddings miss; the cross-encoder decides final relevance. Uses `BM25Retriever.from_defaults(index=index)` from `llama-index-retrievers-bm25`, which indexes the nodes already stored in the persisted vector store.
+- **Hybrid BM25 + vector retrieval**: BM25 nominates candidates with exact term matches that embeddings miss; the cross-encoder decides final relevance.
+- **ChromaDB for clippings**: Persistent SQLite-backed store. Chosen over the JSON store for its metadata filtering and direct chunk-level operations for incremental updates.
+- **PDF validation before indexing**: Pre-check each PDF with pypdf — skip if text extraction yields <100 chars or low printable ratio. Skipped files written to `ocr_needed.txt`.

 ## Development history

- **Aug 2025**: Initial implementation -- build pipeline, embedding model comparison, chunk size experiments, HyDE testing, prompt v1.
- **Jan 2026**: Command-line interface, v2 prompt, error handling improvements, model comparison (command-r7b selected).
- **Feb 2026**: Project tidy-up, cross-encoder re-ranking (v3), v3 prompt for multi-source synthesis, cross-encoder model comparison (L-12 selected), archived superseded scripts. Hybrid BM25 + vector retrieval (v4). Upgraded LlamaIndex from 0.13.1 to 0.14.14; added OpenAI API as optional LLM backend (`llama-index-llms-openai`). Incremental vector store updates (default mode in `build_exp_claude.py`). Fixed offline HuggingFace model loading (env vars must precede imports).
+- **Aug 2025**: Initial implementation -- build pipeline, embedding model comparison, chunk size experiments, HyDE testing.
+- **Jan 2026**: Command-line interface, prompt improvements, model comparison (command-r7b selected).
+- **Feb 2026**: Cross-encoder re-ranking, hybrid BM25+vector retrieval, LlamaIndex upgrade to 0.14.14, OpenAI API backend, incremental updates, clippings search (ChromaDB), project reorganization.

 See `devlog.txt` for detailed development notes and experimental findings.
--- a/archived/query_topk_prompt_engine_v3.py
+++ b/archived/query_topk_prompt_engine_v3.py
--- a/archived/retrieve_raw.py
+++ b/archived/retrieve_raw.py
--- a/build_exp_claude.py
+++ b/build_exp_claude.py
--- a/clippings_search/build_clippings.py
+++ b/clippings_search/build_clippings.py
--- a/clippings_search/retrieve_clippings.py
+++ b/clippings_search/retrieve_clippings.py
--- a/query_hybrid_bm25_v4.py
+++ b/query_hybrid_bm25_v4.py
--- a/retrieve_hybrid_raw.py
+++ b/retrieve_hybrid_raw.py
--- a/run_query.sh
+++ b/run_query.sh
@ -6,7 +6,7 @@

 # Usage: ./run_query.sh 

-QUERY_SCRIPT="query_hybrid_bm25_v4.py"
+QUERY_SCRIPT="query_hybrid.py"

 echo -e "Current query engine is $QUERY_SCRIPT\n"