180 lines
8.5 KiB
Markdown
180 lines
8.5 KiB
Markdown
# ssearch
|
|
|
|
Semantic search over a personal journal archive and a collection of clippings. Uses vector embeddings and a local LLM to find and synthesize information across 1800+ dated text entries spanning 2000-2025, plus a library of PDFs, articles, and web saves.
|
|
|
|
## How it works
|
|
|
|
```
|
|
Query → Embed (BAAI/bge-large-en-v1.5) → Vector similarity (top-30) → Cross-encoder re-rank (top-15) → LLM synthesis (command-r7b via Ollama, or OpenAI API) → Response + sources
|
|
```
|
|
|
|
1. **Build**: Source files are chunked (256 tokens, 25-token overlap) and embedded into a vector store using LlamaIndex. The journal index uses LlamaIndex's JSON store; the clippings index uses ChromaDB. Both support incremental updates.
|
|
2. **Retrieve**: A user query is embedded with the same model and matched against stored vectors by cosine similarity, returning the top 30 candidate chunks.
|
|
3. **Re-rank**: A cross-encoder (`cross-encoder/ms-marco-MiniLM-L-12-v2`) scores each (query, chunk) pair jointly and keeps the top 15.
|
|
4. **Synthesize**: The re-ranked chunks are passed to a local LLM with a custom prompt that encourages multi-source synthesis, producing a grounded answer with file citations.
|
|
|
|
## Project structure
|
|
|
|
```
|
|
ssearch/
|
|
├── build_store.py # Build/update journal vector store (incremental)
|
|
├── query_hybrid.py # Hybrid BM25+vector query with LLM synthesis
|
|
├── retrieve.py # Verbatim hybrid retrieval (no LLM)
|
|
├── search_keywords.py # Keyword search via POS-based term extraction
|
|
├── run_query.sh # Interactive shell wrapper with timing and logging
|
|
├── clippings_search/
|
|
│ ├── build_clippings.py # Build/update clippings vector store (ChromaDB)
|
|
│ ├── retrieve_clippings.py # Verbatim clippings chunk retrieval
|
|
│ └── store_clippings/ # Persisted clippings vector store (ChromaDB)
|
|
├── data/ # Symlink to journal .txt files
|
|
├── clippings/ # Symlink to clippings (PDFs, TXT, webarchive, RTF)
|
|
├── store/ # Persisted journal vector store
|
|
├── models/ # Cached HuggingFace models (offline)
|
|
├── requirements.txt # Python dependencies
|
|
```
|
|
|
|
## Setup
|
|
|
|
**Prerequisites**: Python 3.12, [Ollama](https://ollama.com) with `command-r7b` pulled.
|
|
|
|
```bash
|
|
cd ssearch
|
|
python3 -m venv .venv
|
|
source .venv/bin/activate
|
|
pip install -r requirements.txt
|
|
```
|
|
|
|
The `data/` symlink should point to the journal archive (plain `.txt` files). The `clippings/` symlink should point to the clippings folder. The embedding model (`BAAI/bge-large-en-v1.5`) and cross-encoder (`cross-encoder/ms-marco-MiniLM-L-12-v2`) are cached in `./models/` for offline use.
|
|
|
|
### Offline model loading
|
|
|
|
All query scripts set three environment variables to prevent HuggingFace from making network requests:
|
|
|
|
```python
|
|
os.environ["TOKENIZERS_PARALLELISM"] = "false"
|
|
os.environ["SENTENCE_TRANSFORMERS_HOME"] = "./models"
|
|
os.environ["HF_HUB_OFFLINE"] = "1"
|
|
```
|
|
|
|
**These must appear before any imports that touch HuggingFace libraries.** The `huggingface_hub` library evaluates `HF_HUB_OFFLINE` once at import time (in `huggingface_hub/constants.py`). If the env var is set after imports, the library will still attempt network access and fail offline.
|
|
|
|
Alternatively, set the variable in your shell before running Python:
|
|
```bash
|
|
export HF_HUB_OFFLINE=1
|
|
python query_hybrid.py "your query"
|
|
```
|
|
|
|
## Usage
|
|
|
|
### Build the vector stores
|
|
|
|
```bash
|
|
# Journal index -- incremental update (default)
|
|
python build_store.py
|
|
|
|
# Journal index -- full rebuild
|
|
python build_store.py --rebuild
|
|
|
|
# Clippings index -- incremental update (default)
|
|
python clippings_search/build_clippings.py
|
|
|
|
# Clippings index -- full rebuild
|
|
python clippings_search/build_clippings.py --rebuild
|
|
```
|
|
|
|
The default incremental mode loads the existing index, compares file sizes and modification dates, and only re-indexes what changed. A full rebuild is only needed when chunk parameters or the embedding model change.
|
|
|
|
`build_clippings.py` handles PDFs, TXT, webarchive, and RTF files. PDFs are validated before indexing — those without extractable text are skipped and written to `ocr_needed.txt` for later OCR.
|
|
|
|
### Search journals
|
|
|
|
#### Semantic search with LLM synthesis
|
|
|
|
**Requires Ollama running with `command-r7b`.**
|
|
|
|
**Hybrid BM25 + vector** (`query_hybrid.py`): Retrieves top 20 by vector similarity and top 20 by BM25 term frequency, merges and deduplicates, re-ranks the union to top 15, synthesizes. Catches exact name/term matches that vector-only retrieval misses.
|
|
```bash
|
|
python query_hybrid.py "What does the author say about creativity?"
|
|
```
|
|
|
|
**Interactive wrapper** (`run_query.sh`): Loops for queries, displays timing, and appends queries to `query.log`.
|
|
```bash
|
|
./run_query.sh
|
|
```
|
|
|
|
#### Verbatim chunk retrieval (no LLM)
|
|
|
|
Same hybrid retrieval and re-ranking pipeline but outputs raw chunk text. Each chunk is annotated with its source: `[vector-only]`, `[bm25-only]`, or `[vector+bm25]`. **No Ollama needed.**
|
|
|
|
```bash
|
|
python retrieve.py "Kondiaronk and the Wendats"
|
|
```
|
|
|
|
#### Keyword search (no vector store, no LLM)
|
|
|
|
Extracts nouns and adjectives from the query using NLTK POS tagging, then greps journal files for matches with surrounding context.
|
|
```bash
|
|
python search_keywords.py "Discussions of Kondiaronk and the Wendats"
|
|
```
|
|
|
|
### Search clippings
|
|
|
|
Verbatim chunk retrieval from the clippings index. Same embedding model and cross-encoder re-ranking. Outputs a summary of source files and rankings, then full chunk text. Includes page numbers for PDF sources. **No Ollama needed.**
|
|
|
|
```bash
|
|
python clippings_search/retrieve_clippings.py "creativity and innovation"
|
|
```
|
|
|
|
### Output format
|
|
|
|
```
|
|
Response:
|
|
<LLM-synthesized answer citing specific files>
|
|
|
|
Source documents:
|
|
2024-03-15.txt ./data/2024-03-15.txt 0.683
|
|
2023-11-02.txt ./data/2023-11-02.txt 0.651
|
|
...
|
|
```
|
|
|
|
## Configuration
|
|
|
|
Key parameters (set in source files):
|
|
|
|
| Parameter | Value | Location |
|
|
|-----------|-------|----------|
|
|
| Embedding model | `BAAI/bge-large-en-v1.5` | all build and query scripts |
|
|
| Chunk size | 256 tokens | `build_store.py`, `clippings_search/build_clippings.py` |
|
|
| Chunk overlap | 25 tokens | `build_store.py`, `clippings_search/build_clippings.py` |
|
|
| Paragraph separator | `\n\n` | `build_store.py` |
|
|
| Initial retrieval | 30 chunks | query and retrieve scripts |
|
|
| Re-rank model | `cross-encoder/ms-marco-MiniLM-L-12-v2` | query and retrieve scripts |
|
|
| Re-rank top-n | 15 | query and retrieve scripts |
|
|
| LLM | `command-r7b` (Ollama) or `gpt-4o-mini` (OpenAI API) | `query_hybrid.py` |
|
|
| Temperature | 0.3 | `query_hybrid.py` |
|
|
| Context window | 8000 tokens | `query_hybrid.py` |
|
|
| Request timeout | 360 seconds | `query_hybrid.py` |
|
|
|
|
## Key dependencies
|
|
|
|
- **llama-index-core** (0.14.14) -- RAG framework
|
|
- **llama-index-embeddings-huggingface** -- embedding integration
|
|
- **llama-index-vector-stores-chroma** -- ChromaDB vector store for clippings
|
|
- **llama-index-llms-ollama** -- local LLM via Ollama
|
|
- **llama-index-llms-openai** -- OpenAI API LLM (optional)
|
|
- **llama-index-retrievers-bm25** -- BM25 sparse retrieval for hybrid search
|
|
- **chromadb** -- persistent vector store for clippings index
|
|
- **sentence-transformers** -- cross-encoder re-ranking
|
|
- **torch** -- ML runtime
|
|
|
|
## Design decisions
|
|
|
|
- **BAAI/bge-large-en-v1.5 over all-mpnet-base-v2**: Better semantic matching quality for journal text despite slower embedding.
|
|
- **256-token chunks**: Tested 512 and 384; 256 with 25-token overlap produced the highest quality matches.
|
|
- **command-r7b over llama3.1:8B**: Sticks closer to provided context with less hallucination at comparable speed.
|
|
- **Cross-encoder re-ranking**: Retrieve top-30 via bi-encoder, re-rank to top-15 with a cross-encoder that scores each (query, chunk) pair jointly. Tested three models; `ms-marco-MiniLM-L-12-v2` selected over `stsb-roberta-base` (wrong task) and `BAAI/bge-reranker-v2-m3` (slower, weak score tail).
|
|
- **HyDE query rewriting tested and dropped**: Did not improve results over direct prompt engineering.
|
|
- **Hybrid BM25 + vector retrieval**: BM25 nominates candidates with exact term matches that embeddings miss; the cross-encoder decides final relevance.
|
|
- **ChromaDB for clippings**: Persistent SQLite-backed store. Chosen over the JSON store for its metadata filtering and direct chunk-level operations for incremental updates.
|
|
- **PDF validation before indexing**: Pre-check each PDF with pypdf — skip if text extraction yields <100 chars or low printable ratio. Skipped files written to `ocr_needed.txt`.
|
|
|