Test clean deploy

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
Eric Furst 2026-02-27 05:59:01 -05:00
commit 42e5e20e17
11 changed files with 1790 additions and 0 deletions

36
.gitignore vendored Normal file
View file

@ -0,0 +1,36 @@
# Python
.venv/
__pycache__/
*.pyc
# HuggingFace cached models (large, ~2 GB)
models/
# Vector stores (large, rebuild with build scripts)
store/
clippings_search/store_clippings/
# Data (symlinks to private files)
data
clippings
# Generated file lists
ocr_needed.txt
# IDE and OS
.DS_Store
.vscode/
.idea/
# Jupyter checkpoints
.ipynb_checkpoints/
# Secrets
.env
API_key_temp
# Query log
query.log
# Duplicate of CLAUDE.md
claude.md

21
LICENSE Normal file
View file

@ -0,0 +1,21 @@
MIT License
Copyright (c) 2026 E. M. Furst
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

180
README.md Normal file
View file

@ -0,0 +1,180 @@
# ssearch
Semantic search over a personal journal archive and a collection of clippings. Uses vector embeddings and a local LLM to find and synthesize information across 1800+ dated text entries spanning 2000-2025, plus a library of PDFs, articles, and web saves.
## How it works
```
Query → Embed (BAAI/bge-large-en-v1.5) → Vector similarity (top-30) → Cross-encoder re-rank (top-15) → LLM synthesis (command-r7b via Ollama, or OpenAI API) → Response + sources
```
1. **Build**: Source files are chunked (256 tokens, 25-token overlap) and embedded into a vector store using LlamaIndex. The journal index uses LlamaIndex's JSON store; the clippings index uses ChromaDB. Both support incremental updates.
2. **Retrieve**: A user query is embedded with the same model and matched against stored vectors by cosine similarity, returning the top 30 candidate chunks.
3. **Re-rank**: A cross-encoder (`cross-encoder/ms-marco-MiniLM-L-12-v2`) scores each (query, chunk) pair jointly and keeps the top 15.
4. **Synthesize**: The re-ranked chunks are passed to a local LLM with a custom prompt that encourages multi-source synthesis, producing a grounded answer with file citations.
## Project structure
```
ssearch/
├── build_store.py # Build/update journal vector store (incremental)
├── query_hybrid.py # Hybrid BM25+vector query with LLM synthesis
├── retrieve.py # Verbatim hybrid retrieval (no LLM)
├── search_keywords.py # Keyword search via POS-based term extraction
├── run_query.sh # Interactive shell wrapper with timing and logging
├── clippings_search/
│ ├── build_clippings.py # Build/update clippings vector store (ChromaDB)
│ ├── retrieve_clippings.py # Verbatim clippings chunk retrieval
│ └── store_clippings/ # Persisted clippings vector store (ChromaDB)
├── data/ # Symlink to journal .txt files
├── clippings/ # Symlink to clippings (PDFs, TXT, webarchive, RTF)
├── store/ # Persisted journal vector store
├── models/ # Cached HuggingFace models (offline)
├── requirements.txt # Python dependencies
```
## Setup
**Prerequisites**: Python 3.12, [Ollama](https://ollama.com) with `command-r7b` pulled.
```bash
cd ssearch
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
```
The `data/` symlink should point to the journal archive (plain `.txt` files). The `clippings/` symlink should point to the clippings folder. The embedding model (`BAAI/bge-large-en-v1.5`) and cross-encoder (`cross-encoder/ms-marco-MiniLM-L-12-v2`) are cached in `./models/` for offline use.
### Offline model loading
All query scripts set three environment variables to prevent HuggingFace from making network requests:
```python
os.environ["TOKENIZERS_PARALLELISM"] = "false"
os.environ["SENTENCE_TRANSFORMERS_HOME"] = "./models"
os.environ["HF_HUB_OFFLINE"] = "1"
```
**These must appear before any imports that touch HuggingFace libraries.** The `huggingface_hub` library evaluates `HF_HUB_OFFLINE` once at import time (in `huggingface_hub/constants.py`). If the env var is set after imports, the library will still attempt network access and fail offline.
Alternatively, set the variable in your shell before running Python:
```bash
export HF_HUB_OFFLINE=1
python query_hybrid.py "your query"
```
## Usage
### Build the vector stores
```bash
# Journal index -- incremental update (default)
python build_store.py
# Journal index -- full rebuild
python build_store.py --rebuild
# Clippings index -- incremental update (default)
python clippings_search/build_clippings.py
# Clippings index -- full rebuild
python clippings_search/build_clippings.py --rebuild
```
The default incremental mode loads the existing index, compares file sizes and modification dates, and only re-indexes what changed. A full rebuild is only needed when chunk parameters or the embedding model change.
`build_clippings.py` handles PDFs, TXT, webarchive, and RTF files. PDFs are validated before indexing — those without extractable text are skipped and written to `ocr_needed.txt` for later OCR.
### Search journals
#### Semantic search with LLM synthesis
**Requires Ollama running with `command-r7b`.**
**Hybrid BM25 + vector** (`query_hybrid.py`): Retrieves top 20 by vector similarity and top 20 by BM25 term frequency, merges and deduplicates, re-ranks the union to top 15, synthesizes. Catches exact name/term matches that vector-only retrieval misses.
```bash
python query_hybrid.py "What does the author say about creativity?"
```
**Interactive wrapper** (`run_query.sh`): Loops for queries, displays timing, and appends queries to `query.log`.
```bash
./run_query.sh
```
#### Verbatim chunk retrieval (no LLM)
Same hybrid retrieval and re-ranking pipeline but outputs raw chunk text. Each chunk is annotated with its source: `[vector-only]`, `[bm25-only]`, or `[vector+bm25]`. **No Ollama needed.**
```bash
python retrieve.py "Kondiaronk and the Wendats"
```
#### Keyword search (no vector store, no LLM)
Extracts nouns and adjectives from the query using NLTK POS tagging, then greps journal files for matches with surrounding context.
```bash
python search_keywords.py "Discussions of Kondiaronk and the Wendats"
```
### Search clippings
Verbatim chunk retrieval from the clippings index. Same embedding model and cross-encoder re-ranking. Outputs a summary of source files and rankings, then full chunk text. Includes page numbers for PDF sources. **No Ollama needed.**
```bash
python clippings_search/retrieve_clippings.py "creativity and innovation"
```
### Output format
```
Response:
<LLM-synthesized answer citing specific files>
Source documents:
2024-03-15.txt ./data/2024-03-15.txt 0.683
2023-11-02.txt ./data/2023-11-02.txt 0.651
...
```
## Configuration
Key parameters (set in source files):
| Parameter | Value | Location |
|-----------|-------|----------|
| Embedding model | `BAAI/bge-large-en-v1.5` | all build and query scripts |
| Chunk size | 256 tokens | `build_store.py`, `clippings_search/build_clippings.py` |
| Chunk overlap | 25 tokens | `build_store.py`, `clippings_search/build_clippings.py` |
| Paragraph separator | `\n\n` | `build_store.py` |
| Initial retrieval | 30 chunks | query and retrieve scripts |
| Re-rank model | `cross-encoder/ms-marco-MiniLM-L-12-v2` | query and retrieve scripts |
| Re-rank top-n | 15 | query and retrieve scripts |
| LLM | `command-r7b` (Ollama) or `gpt-4o-mini` (OpenAI API) | `query_hybrid.py` |
| Temperature | 0.3 | `query_hybrid.py` |
| Context window | 8000 tokens | `query_hybrid.py` |
| Request timeout | 360 seconds | `query_hybrid.py` |
## Key dependencies
- **llama-index-core** (0.14.14) -- RAG framework
- **llama-index-embeddings-huggingface** -- embedding integration
- **llama-index-vector-stores-chroma** -- ChromaDB vector store for clippings
- **llama-index-llms-ollama** -- local LLM via Ollama
- **llama-index-llms-openai** -- OpenAI API LLM (optional)
- **llama-index-retrievers-bm25** -- BM25 sparse retrieval for hybrid search
- **chromadb** -- persistent vector store for clippings index
- **sentence-transformers** -- cross-encoder re-ranking
- **torch** -- ML runtime
## Design decisions
- **BAAI/bge-large-en-v1.5 over all-mpnet-base-v2**: Better semantic matching quality for journal text despite slower embedding.
- **256-token chunks**: Tested 512 and 384; 256 with 25-token overlap produced the highest quality matches.
- **command-r7b over llama3.1:8B**: Sticks closer to provided context with less hallucination at comparable speed.
- **Cross-encoder re-ranking**: Retrieve top-30 via bi-encoder, re-rank to top-15 with a cross-encoder that scores each (query, chunk) pair jointly. Tested three models; `ms-marco-MiniLM-L-12-v2` selected over `stsb-roberta-base` (wrong task) and `BAAI/bge-reranker-v2-m3` (slower, weak score tail).
- **HyDE query rewriting tested and dropped**: Did not improve results over direct prompt engineering.
- **Hybrid BM25 + vector retrieval**: BM25 nominates candidates with exact term matches that embeddings miss; the cross-encoder decides final relevance.
- **ChromaDB for clippings**: Persistent SQLite-backed store. Chosen over the JSON store for its metadata filtering and direct chunk-level operations for incremental updates.
- **PDF validation before indexing**: Pre-check each PDF with pypdf — skip if text extraction yields <100 chars or low printable ratio. Skipped files written to `ocr_needed.txt`.

193
build_store.py Normal file
View file

@ -0,0 +1,193 @@
# build_store.py
#
# Build or update the vector store from journal entries in ./data.
#
# Default mode (incremental): loads the existing index and adds only
# new or modified files. Use --rebuild for a full rebuild from scratch.
#
# January 2026
# E. M. Furst
# Used Sonnet 4.5 to suggest changes; Opus 4.6 for incremental update
from llama_index.core import (
SimpleDirectoryReader,
StorageContext,
VectorStoreIndex,
load_index_from_storage,
Settings,
)
from pathlib import Path
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core.node_parser import SentenceSplitter
import argparse
import datetime
import os
import time
# Shared constants
DATA_DIR = Path("./data")
PERSIST_DIR = "./store"
EMBED_MODEL_NAME = "BAAI/bge-large-en-v1.5"
CHUNK_SIZE = 256
CHUNK_OVERLAP = 25
def get_text_splitter():
return SentenceSplitter(
chunk_size=CHUNK_SIZE,
chunk_overlap=CHUNK_OVERLAP,
paragraph_separator="\n\n",
)
def rebuild():
"""Full rebuild: delete and recreate the vector store from scratch."""
if not DATA_DIR.exists():
raise FileNotFoundError(f"Data directory not found: {DATA_DIR.absolute()}")
print(f"Loading documents from {DATA_DIR.absolute()}...")
documents = SimpleDirectoryReader(str(DATA_DIR)).load_data()
if not documents:
raise ValueError("No documents found in data directory")
print(f"Loaded {len(documents)} document(s)")
print("Building vector index...")
index = VectorStoreIndex.from_documents(
documents,
transformations=[get_text_splitter()],
show_progress=True,
)
index.storage_context.persist(persist_dir=PERSIST_DIR)
print(f"Index built and saved to {PERSIST_DIR}")
def update():
"""Incremental update: add new files, re-index modified files, remove deleted files."""
if not DATA_DIR.exists():
raise FileNotFoundError(f"Data directory not found: {DATA_DIR.absolute()}")
# Load existing index
print(f"Loading existing index from {PERSIST_DIR}...")
storage_context = StorageContext.from_defaults(persist_dir=PERSIST_DIR)
index = load_index_from_storage(storage_context)
# Set transformations so index.insert() chunks correctly
Settings.transformations = [get_text_splitter()]
# Build lookup of indexed files: file_name -> (ref_doc_id, metadata)
all_ref_docs = index.docstore.get_all_ref_doc_info()
indexed = {}
for ref_id, info in all_ref_docs.items():
fname = info.metadata.get("file_name")
if fname:
indexed[fname] = (ref_id, info.metadata)
print(f"Index contains {len(indexed)} documents")
# Scan current files on disk
disk_files = {f.name: f for f in sorted(DATA_DIR.glob("*.txt"))}
print(f"Data directory contains {len(disk_files)} files")
# Classify files
new_files = []
modified_files = []
deleted_files = []
unchanged = 0
for fname, fpath in disk_files.items():
if fname not in indexed:
new_files.append(fpath)
else:
ref_id, meta = indexed[fname]
# Compare file size and modification date
stat = fpath.stat()
disk_size = stat.st_size
# Must use UTC to match SimpleDirectoryReader's date format
disk_mdate = datetime.datetime.fromtimestamp(
stat.st_mtime, tz=datetime.timezone.utc
).strftime("%Y-%m-%d")
stored_size = meta.get("file_size")
stored_mdate = meta.get("last_modified_date")
if disk_size != stored_size or disk_mdate != stored_mdate:
modified_files.append((fpath, ref_id))
else:
unchanged += 1
for fname, (ref_id, meta) in indexed.items():
if fname not in disk_files:
deleted_files.append((fname, ref_id))
# Report
print(f"\n New: {len(new_files)}")
print(f" Modified: {len(modified_files)}")
print(f" Deleted: {len(deleted_files)}")
print(f" Unchanged: {unchanged}")
if not new_files and not modified_files and not deleted_files:
print("\nNothing to do.")
return
# Process deletions (including modified files that need re-indexing)
for fname, ref_id in deleted_files:
print(f" Removing {fname}")
index.delete_ref_doc(ref_id, delete_from_docstore=True)
for fpath, ref_id in modified_files:
print(f" Re-indexing {fpath.name} (modified)")
index.delete_ref_doc(ref_id, delete_from_docstore=True)
# Process additions (new files + modified files)
files_to_add = new_files + [fpath for fpath, _ in modified_files]
if files_to_add:
print(f"\nIndexing {len(files_to_add)} file(s)...")
# Use "./" prefix to match paths from full build (pathlib strips it)
docs = SimpleDirectoryReader(
input_files=[f"./{f}" for f in files_to_add]
).load_data()
for doc in docs:
index.insert(doc)
# Persist
index.storage_context.persist(persist_dir=PERSIST_DIR)
print(f"\nIndex updated and saved to {PERSIST_DIR}")
def main():
parser = argparse.ArgumentParser(
description="Build or update the vector store from journal entries."
)
parser.add_argument(
"--rebuild",
action="store_true",
help="Full rebuild from scratch (default: incremental update)",
)
args = parser.parse_args()
# Configure embedding model
embed_model = HuggingFaceEmbedding(model_name=EMBED_MODEL_NAME)
Settings.embed_model = embed_model
start = time.time()
if args.rebuild:
print("Mode: full rebuild")
rebuild()
else:
print("Mode: incremental update")
if not Path(PERSIST_DIR).exists():
print(f"No existing index at {PERSIST_DIR}, doing full rebuild.")
rebuild()
else:
update()
elapsed = time.time() - start
print(f"Done in {elapsed:.1f}s")
if __name__ == "__main__":
main()

View file

@ -0,0 +1,471 @@
# build_clippings.py
#
# Build or update the ChromaDB vector store from clippings in ./clippings.
#
# Default mode (incremental): loads the existing index and adds only
# new or modified files. Use --rebuild for a full rebuild from scratch.
#
# Handles PDFs, TXT, webarchive, and RTF files. Skips non-extractable PDFs
# and writes them to ocr_needed.txt for later OCR processing.
#
# February 2026
# E. M. Furst
# Environment vars must be set before importing huggingface/transformers
# libraries, because huggingface_hub.constants evaluates HF_HUB_OFFLINE
# at import time.
import os
os.environ["TOKENIZERS_PARALLELISM"] = "false"
os.environ["SENTENCE_TRANSFORMERS_HOME"] = "./models"
os.environ["HF_HUB_OFFLINE"] = "1"
import chromadb
from llama_index.core import (
SimpleDirectoryReader,
StorageContext,
VectorStoreIndex,
Settings,
Document,
)
from llama_index.vector_stores.chroma import ChromaVectorStore
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core.node_parser import SentenceSplitter
from pathlib import Path
import argparse
import datetime
import time
# Shared constants
DATA_DIR = Path("./clippings")
PERSIST_DIR = "./clippings_search/store_clippings"
COLLECTION_NAME = "clippings"
EMBED_MODEL_NAME = "BAAI/bge-large-en-v1.5"
CHUNK_SIZE = 256
CHUNK_OVERLAP = 25
# File types handled by SimpleDirectoryReader (PDF + TXT)
READER_EXTS = {".pdf", ".txt"}
# File types handled by custom loaders
CUSTOM_EXTS = {".webarchive", ".rtf"}
# All supported extensions
SUPPORTED_EXTS = READER_EXTS | CUSTOM_EXTS
# Minimum extracted text length to consider a PDF valid (characters)
MIN_TEXT_LENGTH = 100
def get_text_splitter():
return SentenceSplitter(
chunk_size=CHUNK_SIZE,
chunk_overlap=CHUNK_OVERLAP,
paragraph_separator="\n\n",
)
def validate_pdf(file_path):
"""Check if a PDF has extractable text.
Returns (is_valid, reason) where reason explains why it was skipped.
"""
import pypdf
try:
reader = pypdf.PdfReader(str(file_path))
page_count = len(reader.pages)
total_chars = 0
printable_chars = 0
for page in reader.pages:
text = page.extract_text() or ""
total_chars += len(text)
printable_chars += sum(
1 for c in text if c.isprintable() or c in "\n\r\t"
)
if total_chars < MIN_TEXT_LENGTH:
return False, f"too little text ({total_chars} chars, {page_count} pages)"
ratio = printable_chars / total_chars if total_chars > 0 else 0
if ratio < 0.5:
return False, f"low printable ratio ({ratio:.2f}, {page_count} pages)"
return True, None
except Exception as e:
return False, str(e)
def load_webarchive(file_path):
"""Extract text from a macOS .webarchive file.
Returns a LlamaIndex Document, or None if extraction fails.
"""
import plistlib
from bs4 import BeautifulSoup
try:
with open(file_path, "rb") as f:
plist = plistlib.load(f)
resource = plist.get("WebMainResource", {})
html_bytes = resource.get("WebResourceData", b"")
if not html_bytes:
return None
html = html_bytes.decode("utf-8", errors="replace")
soup = BeautifulSoup(html, "html.parser")
text = soup.get_text(separator="\n", strip=True)
if len(text) < MIN_TEXT_LENGTH:
return None
stat = file_path.stat()
mdate = datetime.datetime.fromtimestamp(
stat.st_mtime, tz=datetime.timezone.utc
).strftime("%Y-%m-%d")
return Document(
text=text,
metadata={
"file_name": file_path.name,
"file_path": str(file_path),
"file_size": stat.st_size,
"last_modified_date": mdate,
"file_type": "webarchive",
},
)
except Exception as e:
print(f" Warning: could not read webarchive {file_path.name}: {e}")
return None
def load_rtf(file_path):
"""Extract text from an RTF file.
Returns a LlamaIndex Document, or None if extraction fails.
"""
from striprtf.striprtf import rtf_to_text
try:
with open(file_path, "r", errors="replace") as f:
rtf_content = f.read()
text = rtf_to_text(rtf_content)
if len(text) < MIN_TEXT_LENGTH:
return None
stat = file_path.stat()
mdate = datetime.datetime.fromtimestamp(
stat.st_mtime, tz=datetime.timezone.utc
).strftime("%Y-%m-%d")
return Document(
text=text,
metadata={
"file_name": file_path.name,
"file_path": str(file_path),
"file_size": stat.st_size,
"last_modified_date": mdate,
"file_type": "rtf",
},
)
except Exception as e:
print(f" Warning: could not read RTF {file_path.name}: {e}")
return None
def scan_clippings():
"""Scan the clippings directory and classify files.
Returns (reader_files, custom_docs, skipped, ocr_needed) where:
- reader_files: list of Paths for SimpleDirectoryReader (PDF + TXT)
- custom_docs: list of Document objects from custom loaders
- skipped: list of (Path, reason) tuples
- ocr_needed: list of Paths for PDFs that need OCR
"""
reader_files = []
custom_docs = []
skipped = []
ocr_needed = []
for fpath in sorted(DATA_DIR.rglob("*")):
if not fpath.is_file():
continue
if fpath.name.startswith("."):
continue
ext = fpath.suffix.lower()
if ext not in SUPPORTED_EXTS:
skipped.append((fpath, f"unsupported type: {ext}"))
continue
if ext == ".pdf":
is_valid, reason = validate_pdf(fpath)
if not is_valid:
skipped.append((fpath, f"no extractable text: {reason}"))
ocr_needed.append(fpath)
continue
reader_files.append(fpath)
elif ext == ".txt":
reader_files.append(fpath)
elif ext == ".webarchive":
doc = load_webarchive(fpath)
if doc:
custom_docs.append(doc)
else:
skipped.append((fpath, "no extractable text from webarchive"))
elif ext == ".rtf":
doc = load_rtf(fpath)
if doc:
custom_docs.append(doc)
else:
skipped.append((fpath, "no extractable text from RTF"))
return reader_files, custom_docs, skipped, ocr_needed
def write_ocr_list(ocr_needed):
"""Write the list of PDFs needing OCR to ocr_needed.txt."""
with open("ocr_needed.txt", "w") as f:
for fpath in ocr_needed:
f.write(f"{fpath}\n")
print(f"Wrote {len(ocr_needed)} file(s) to ocr_needed.txt")
def load_all_documents(reader_files, custom_docs):
"""Load documents from SimpleDirectoryReader and merge with custom docs."""
documents = []
if reader_files:
print(f"Loading {len(reader_files)} PDF/TXT files...")
reader_docs = SimpleDirectoryReader(
input_files=[str(f) for f in reader_files],
filename_as_id=True,
).load_data()
documents.extend(reader_docs)
if custom_docs:
print(f"Adding {len(custom_docs)} webarchive/RTF documents...")
documents.extend(custom_docs)
return documents
def rebuild(reader_files, custom_docs):
"""Full rebuild: delete existing collection and recreate from scratch."""
client = chromadb.PersistentClient(path=PERSIST_DIR)
# Delete existing collection if present
try:
client.delete_collection(COLLECTION_NAME)
print(f"Deleted existing collection '{COLLECTION_NAME}'")
except Exception:
pass
collection = client.get_or_create_collection(COLLECTION_NAME)
vector_store = ChromaVectorStore(chroma_collection=collection)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
documents = load_all_documents(reader_files, custom_docs)
if not documents:
raise ValueError("No documents loaded")
print(f"Loaded {len(documents)} document(s) total")
print("Building vector index...")
index = VectorStoreIndex.from_documents(
documents,
storage_context=storage_context,
transformations=[get_text_splitter()],
show_progress=True,
)
print(f"Index built. Collection has {collection.count()} vectors.")
return index
def update(reader_files, custom_docs):
"""Incremental update: add new, re-index modified, remove deleted files."""
client = chromadb.PersistentClient(path=PERSIST_DIR)
collection = client.get_collection(COLLECTION_NAME)
count = collection.count()
print(f"Existing collection has {count} vectors")
# Get all stored metadata to find what's indexed
# Key on file_path (not file_name) to handle duplicate names across subdirs
indexed = {} # file_path -> {"ids": [], "file_size": ..., "last_modified_date": ...}
if count > 0:
results = collection.get(include=["metadatas"])
for i, meta in enumerate(results["metadatas"]):
fpath = meta.get("file_path", "")
if fpath not in indexed:
indexed[fpath] = {
"ids": [],
"file_size": meta.get("file_size"),
"last_modified_date": meta.get("last_modified_date"),
}
indexed[fpath]["ids"].append(results["ids"][i])
print(f"Index contains {len(indexed)} unique files")
# Build disk file lookup: file_path_str -> Path
# For reader_files, match the path format SimpleDirectoryReader would store
disk_files = {}
for f in reader_files:
disk_files[str(f)] = f
for doc in custom_docs:
disk_files[doc.metadata["file_path"]] = Path(doc.metadata["file_path"])
# Classify files
new_reader = []
new_custom = []
modified_reader = []
modified_custom = []
deleted_paths = []
unchanged = 0
for path_str, fpath in disk_files.items():
if path_str not in indexed:
# Check if it's a custom doc
if fpath.suffix.lower() in CUSTOM_EXTS:
matching = [d for d in custom_docs if d.metadata["file_path"] == path_str]
if matching:
new_custom.extend(matching)
else:
new_reader.append(fpath)
else:
info = indexed[path_str]
stat = fpath.stat()
disk_mdate = datetime.datetime.fromtimestamp(
stat.st_mtime, tz=datetime.timezone.utc
).strftime("%Y-%m-%d")
if stat.st_size != info["file_size"] or disk_mdate != info["last_modified_date"]:
if fpath.suffix.lower() in CUSTOM_EXTS:
matching = [d for d in custom_docs if d.metadata["file_path"] == path_str]
if matching:
modified_custom.extend(matching)
else:
modified_reader.append(fpath)
else:
unchanged += 1
for path_str in indexed:
if path_str not in disk_files:
deleted_paths.append(path_str)
n_new = len(new_reader) + len(new_custom)
n_modified = len(modified_reader) + len(modified_custom)
print(f"\n New: {n_new}")
print(f" Modified: {n_modified}")
print(f" Deleted: {len(deleted_paths)}")
print(f" Unchanged: {unchanged}")
if n_new == 0 and n_modified == 0 and len(deleted_paths) == 0:
print("\nNothing to do.")
return
# Delete chunks for removed and modified files
for path_str in deleted_paths:
ids = indexed[path_str]["ids"]
fname = Path(path_str).name
print(f" Removing {fname} ({len(ids)} chunks)")
collection.delete(ids=ids)
for fpath in modified_reader:
path_str = str(fpath)
ids = indexed[path_str]["ids"]
print(f" Re-indexing {fpath.name} ({len(ids)} chunks)")
collection.delete(ids=ids)
for doc in modified_custom:
path_str = doc.metadata["file_path"]
if path_str in indexed:
ids = indexed[path_str]["ids"]
print(f" Re-indexing {doc.metadata['file_name']} ({len(ids)} chunks)")
collection.delete(ids=ids)
# Add new and modified files
files_to_add = new_reader + modified_reader
docs_to_add = new_custom + modified_custom
if files_to_add or docs_to_add:
documents = load_all_documents(files_to_add, docs_to_add)
if documents:
print(f"Indexing {len(documents)} document(s)...")
vector_store = ChromaVectorStore(chroma_collection=collection)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
VectorStoreIndex.from_documents(
documents,
storage_context=storage_context,
transformations=[get_text_splitter()],
show_progress=True,
)
print(f"\nIndex updated. Collection now has {collection.count()} vectors.")
def main():
parser = argparse.ArgumentParser(
description="Build or update the clippings vector store (ChromaDB)."
)
parser.add_argument(
"--rebuild",
action="store_true",
help="Full rebuild from scratch (default: incremental update)",
)
args = parser.parse_args()
# Configure embedding model (offline, cached in ./models)
embed_model = HuggingFaceEmbedding(
model_name=EMBED_MODEL_NAME,
cache_folder="./models",
local_files_only=True,
)
Settings.embed_model = embed_model
if not DATA_DIR.exists():
raise FileNotFoundError(
f"Clippings directory not found: {DATA_DIR.absolute()}\n"
f"Create symlink: ln -s ../clippings ./clippings"
)
start = time.time()
# Scan and classify files
print(f"Scanning {DATA_DIR.absolute()}...")
reader_files, custom_docs, skipped, ocr_needed = scan_clippings()
n_valid = len(reader_files) + len(custom_docs)
print(f"\nFiles to index: {n_valid}")
print(f" PDF/TXT: {len(reader_files)}")
print(f" Webarchive/RTF: {len(custom_docs)}")
print(f"Files skipped: {len(skipped)}")
for fpath, reason in skipped:
print(f" SKIP: {fpath.name} -- {reason}")
if ocr_needed:
write_ocr_list(ocr_needed)
if n_valid == 0:
raise ValueError("No valid files found to index")
if args.rebuild:
print("\nMode: full rebuild")
rebuild(reader_files, custom_docs)
else:
print("\nMode: incremental update")
if not Path(PERSIST_DIR).exists():
print(f"No existing index at {PERSIST_DIR}, doing full rebuild.")
rebuild(reader_files, custom_docs)
else:
update(reader_files, custom_docs)
elapsed = time.time() - start
print(f"Done in {elapsed:.1f}s")
if __name__ == "__main__":
main()

View file

@ -0,0 +1,138 @@
# retrieve_clippings.py
# Verbatim chunk retrieval from clippings index (ChromaDB).
# Vector search + cross-encoder re-ranking, no LLM.
#
# Returns the top re-ranked chunks with their full text, file metadata, and
# scores. Includes page numbers for PDF sources when available.
#
# E.M.F. February 2026
# Environment vars must be set before importing huggingface/transformers
# libraries, because huggingface_hub.constants evaluates HF_HUB_OFFLINE
# at import time.
import os
os.environ["TOKENIZERS_PARALLELISM"] = "false"
os.environ["SENTENCE_TRANSFORMERS_HOME"] = "./models"
os.environ["HF_HUB_OFFLINE"] = "1"
import chromadb
from llama_index.core import VectorStoreIndex, Settings
from llama_index.vector_stores.chroma import ChromaVectorStore
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core.postprocessor import SentenceTransformerRerank
import sys
import textwrap
#
# Globals
#
PERSIST_DIR = "./clippings_search/store_clippings"
COLLECTION_NAME = "clippings"
# Embedding model (must match build_clippings.py)
EMBED_MODEL = HuggingFaceEmbedding(
cache_folder="./models",
model_name="BAAI/bge-large-en-v1.5",
local_files_only=True,
)
# Cross-encoder model for re-ranking (cached in ./models/)
RERANK_MODEL = "cross-encoder/ms-marco-MiniLM-L-12-v2"
RERANK_TOP_N = 15
RETRIEVE_TOP_K = 30
# Output formatting
WRAP_WIDTH = 80
def main():
# No LLM needed -- set embed model only
Settings.embed_model = EMBED_MODEL
# Load ChromaDB collection
client = chromadb.PersistentClient(path=PERSIST_DIR)
collection = client.get_collection(COLLECTION_NAME)
# Build index from existing vector store
vector_store = ChromaVectorStore(chroma_collection=collection)
index = VectorStoreIndex.from_vector_store(vector_store)
# Build retriever (vector search only, no query engine / LLM)
retriever = index.as_retriever(similarity_top_k=RETRIEVE_TOP_K)
# Cross-encoder re-ranker
reranker = SentenceTransformerRerank(
model=RERANK_MODEL,
top_n=RERANK_TOP_N,
)
# Query
if len(sys.argv) < 2:
print("Usage: python retrieve_clippings.py QUERY_TEXT")
sys.exit(1)
q = " ".join(sys.argv[1:])
# Retrieve and re-rank
nodes = retriever.retrieve(q)
reranked = reranker.postprocess_nodes(nodes, query_str=q)
# Build result list with metadata
results = []
for i, node in enumerate(reranked, 1):
meta = getattr(node, "metadata", None) or node.node.metadata
score = getattr(node, "score", None)
file_name = meta.get("file_name", "unknown")
page_label = meta.get("page_label", "")
results.append((i, node, file_name, page_label, score))
# --- Summary: source files and rankings ---
print(f"\nQuery: {q}")
print(f"Retrieved {len(nodes)} chunks, re-ranked to top {len(reranked)}")
print(f"({collection.count()} total vectors in collection)\n")
# Unique source files in rank order
seen = set()
unique_sources = []
for i, node, file_name, page_label, score in results:
if file_name not in seen:
seen.add(file_name)
unique_sources.append(file_name)
print(f"Source files ({len(unique_sources)} unique):")
for j, fname in enumerate(unique_sources, 1):
print(f" {j}. {fname}")
print(f"\nRankings:")
for i, node, file_name, page_label, score in results:
line = f" [{i:2d}] {score:+7.3f} {file_name}"
if page_label:
line += f" (p. {page_label})"
print(line)
# --- Full chunk text ---
print(f"\n{'=' * WRAP_WIDTH}")
print("CHUNKS")
print("=" * WRAP_WIDTH)
for i, node, file_name, page_label, score in results:
header = f"=== [{i}] {file_name}"
if page_label:
header += f" (p. {page_label})"
header += f" (score: {score:.3f})"
print("\n" + "=" * WRAP_WIDTH)
print(header)
print("=" * WRAP_WIDTH)
text = node.get_content()
for line in text.splitlines():
if line.strip():
print(textwrap.fill(line, width=WRAP_WIDTH))
else:
print()
print()
if __name__ == "__main__":
main()

176
query_hybrid.py Normal file
View file

@ -0,0 +1,176 @@
# query_hybrid.py
# Hybrid retrieval: BM25 (sparse) + vector similarity (dense) + cross-encoder
#
# Combines two retrieval strategies to catch both exact term matches and
# semantic similarity:
# 1. Retrieve top-20 via vector similarity (bi-encoder, catches meaning)
# 2. Retrieve top-20 via BM25 (term frequency, catches exact names/dates)
# 3. Merge and deduplicate candidates by node ID
# 4. Re-rank the union with a cross-encoder -> top-15
# 5. Pass re-ranked chunks to LLM for synthesis
#
# The cross-encoder doesn't care where candidates came from -- it scores
# each (query, chunk) pair on its own merits. BM25's job is just to
# nominate candidates that vector similarity might miss.
#
# E.M.F. February 2026
# Environment vars must be set before importing huggingface/transformers
# libraries, because huggingface_hub.constants evaluates HF_HUB_OFFLINE
# at import time.
import os
os.environ["TOKENIZERS_PARALLELISM"] = "false"
os.environ["SENTENCE_TRANSFORMERS_HOME"] = "./models"
os.environ["HF_HUB_OFFLINE"] = "1"
from llama_index.core import (
StorageContext,
load_index_from_storage,
Settings,
get_response_synthesizer,
)
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.llms.ollama import Ollama
from llama_index.core.prompts import PromptTemplate
from llama_index.core.postprocessor import SentenceTransformerRerank
from llama_index.retrievers.bm25 import BM25Retriever
import sys
#
# Globals
#
# Embedding model (must match build_store.py)
EMBED_MODEL = HuggingFaceEmbedding(cache_folder="./models", model_name="BAAI/bge-large-en-v1.5", local_files_only=True)
# LLM model for generation
LLM_MODEL = "command-r7b"
# Cross-encoder model for re-ranking (cached in ./models/)
RERANK_MODEL = "cross-encoder/ms-marco-MiniLM-L-12-v2"
RERANK_TOP_N = 15
# Retrieval parameters
VECTOR_TOP_K = 20 # candidates from vector similarity
BM25_TOP_K = 20 # candidates from BM25 term matching
#
# Custom prompt -- same as v3
#
PROMPT = PromptTemplate(
"""You are a precise research assistant analyzing excerpts from a personal journal collection.
Every excerpt below has been selected and ranked for relevance to the query.
CONTEXT (ranked by relevance):
{context_str}
QUERY:
{query_str}
Instructions:
- Answer ONLY using information explicitly present in the CONTEXT above
- Examine ALL provided excerpts, not just the top few -- each one was selected for relevance
- Be specific: quote or closely paraphrase key passages and cite their file names
- When multiple files touch on the query, note what each one contributes
- If the context doesn't contain enough information to answer fully, say so
Your response should:
1. Directly answer the query, drawing on as many relevant excerpts as possible
2. Reference specific files and their content (e.g., "In <filename>, ...")
3. End with a list of all files that contributed to your answer, with a brief note on each
If the context is insufficient, explain what's missing."""
)
def main():
# Configure LLM and embedding model
# for local model using ollama
# Note: Ollama temperature defaults to 0.8
Settings.llm = Ollama(
model=LLM_MODEL,
temperature=0.3,
request_timeout=360.0,
context_window=8000,
)
# Use OpenAI API:
# from llama_index.llms.openai import OpenAI
# Settings.llm = OpenAI(
# model="gpt-4o-mini", # or "gpt-4o" for higher quality
# temperature=0.3,
# )
Settings.embed_model = EMBED_MODEL
# Load persisted vector store
storage_context = StorageContext.from_defaults(persist_dir="./store")
index = load_index_from_storage(storage_context)
# --- Retrievers ---
# Vector retriever (dense: cosine similarity over embeddings)
vector_retriever = index.as_retriever(similarity_top_k=VECTOR_TOP_K)
# BM25 retriever (sparse: term frequency scoring)
bm25_retriever = BM25Retriever.from_defaults(
index=index,
similarity_top_k=BM25_TOP_K,
)
# Cross-encoder re-ranker
reranker = SentenceTransformerRerank(
model=RERANK_MODEL,
top_n=RERANK_TOP_N,
)
# --- Query ---
if len(sys.argv) < 2:
print("Usage: python query_hybrid_bm25_v4.py QUERY_TEXT")
sys.exit(1)
q = " ".join(sys.argv[1:])
# Retrieve from both sources
vector_nodes = vector_retriever.retrieve(q)
bm25_nodes = bm25_retriever.retrieve(q)
# Merge and deduplicate by node ID
seen_ids = set()
merged = []
for node in vector_nodes + bm25_nodes:
node_id = node.node.node_id
if node_id not in seen_ids:
seen_ids.add(node_id)
merged.append(node)
# Re-rank the merged candidates with cross-encoder
reranked = reranker.postprocess_nodes(merged, query_str=q)
# Report retrieval stats
n_vector_only = len([n for n in vector_nodes if n.node.node_id not in {b.node.node_id for b in bm25_nodes}])
n_bm25_only = len([n for n in bm25_nodes if n.node.node_id not in {v.node.node_id for v in vector_nodes}])
n_both = len(vector_nodes) + len(bm25_nodes) - len(merged)
print(f"\nQuery: {q}")
print(f"Vector: {len(vector_nodes)}, BM25: {len(bm25_nodes)}, "
f"overlap: {n_both}, merged: {len(merged)}, re-ranked to: {len(reranked)}")
# Synthesize response with LLM
synthesizer = get_response_synthesizer(text_qa_template=PROMPT)
response = synthesizer.synthesize(q, nodes=reranked)
# Output
print("\nResponse:\n")
print(response.response)
print("\nSource documents:")
for node in response.source_nodes:
meta = getattr(node, "metadata", None) or node.node.metadata
score = getattr(node, "score", None)
print(f"{meta.get('file_name')} {meta.get('file_path')} {score:.3f}")
if __name__ == "__main__":
main()

216
requirements.txt Normal file
View file

@ -0,0 +1,216 @@
aiohappyeyeballs==2.6.1
aiohttp==3.12.15
aiosignal==1.4.0
aiosqlite==0.21.0
annotated-doc==0.0.4
annotated-types==0.7.0
anyio==4.10.0
appnope==0.1.4
argon2-cffi==25.1.0
argon2-cffi-bindings==25.1.0
arrow==1.3.0
asttokens==3.0.0
async-lru==2.0.5
attrs==25.3.0
babel==2.17.0
backoff==2.2.1
banks==2.2.0
bcrypt==5.0.0
beautifulsoup4==4.13.4
bleach==6.2.0
bm25s==0.2.14
build==1.4.0
certifi==2025.8.3
cffi==1.17.1
charset-normalizer==3.4.3
chromadb==1.5.1
click==8.2.1
colorama==0.4.6
comm==0.2.3
contourpy==1.3.3
cycler==0.12.1
dataclasses-json==0.6.7
debugpy==1.8.16
decorator==5.2.1
defusedxml==0.7.1
Deprecated==1.2.18
dirtyjson==1.0.8
distro==1.9.0
durationpy==0.10
executing==2.2.0
fastjsonschema==2.21.1
filelock==3.18.0
filetype==1.2.0
flatbuffers==25.12.19
fonttools==4.59.1
fqdn==1.5.1
frozenlist==1.7.0
fsspec==2025.7.0
googleapis-common-protos==1.72.0
greenlet==3.2.4
griffe==1.11.0
grpcio==1.78.1
h11==0.16.0
hf-xet==1.1.7
httpcore==1.0.9
httptools==0.7.1
httpx==0.28.1
huggingface-hub==0.34.4
idna==3.10
importlib_metadata==8.7.1
importlib_resources==6.5.2
ipykernel==6.30.1
ipython==9.4.0
ipython_pygments_lexers==1.1.1
ipywidgets==8.1.7
isoduration==20.11.0
jedi==0.19.2
Jinja2==3.1.6
jiter==0.13.0
joblib==1.5.1
json5==0.12.1
jsonpointer==3.0.0
jsonschema==4.25.0
jsonschema-specifications==2025.4.1
jupyter==1.1.1
jupyter-console==6.6.3
jupyter-events==0.12.0
jupyter-lsp==2.2.6
jupyter_client==8.6.3
jupyter_core==5.8.1
jupyter_server==2.16.0
jupyter_server_terminals==0.5.3
jupyterlab==4.4.5
jupyterlab_pygments==0.3.0
jupyterlab_server==2.27.3
jupyterlab_widgets==3.0.15
kiwisolver==1.4.9
kubernetes==35.0.0
lark==1.2.2
llama-index-core==0.14.14
llama-index-embeddings-huggingface==0.6.1
llama-index-instrumentation==0.4.0
llama-index-llms-ollama==0.9.1
llama-index-llms-openai==0.6.18
llama-index-readers-file==0.5.6
llama-index-retrievers-bm25==0.6.5
llama-index-vector-stores-chroma==0.5.5
llama-index-workflows==2.14.2
markdown-it-py==4.0.0
MarkupSafe==3.0.2
marshmallow==3.26.1
matplotlib==3.10.5
matplotlib-inline==0.1.7
mdurl==0.1.2
mistune==3.1.3
mmh3==5.2.0
mpmath==1.3.0
multidict==6.6.3
mypy_extensions==1.1.0
nbclient==0.10.2
nbconvert==7.16.6
nbformat==5.10.4
nest-asyncio==1.6.0
networkx==3.5
nltk==3.9.1
notebook==7.4.5
notebook_shim==0.2.4
numpy==2.3.2
oauthlib==3.3.1
ollama==0.5.3
onnxruntime==1.24.2
openai==2.21.0
opentelemetry-api==1.39.1
opentelemetry-exporter-otlp-proto-common==1.39.1
opentelemetry-exporter-otlp-proto-grpc==1.39.1
opentelemetry-proto==1.39.1
opentelemetry-sdk==1.39.1
opentelemetry-semantic-conventions==0.60b1
orjson==3.11.7
overrides==7.7.0
packaging==25.0
pandas==2.2.3
pandocfilters==1.5.1
parso==0.8.4
pexpect==4.9.0
pillow==11.3.0
platformdirs==4.3.8
posthog==5.4.0
prometheus_client==0.22.1
prompt_toolkit==3.0.51
propcache==0.3.2
protobuf==6.33.5
psutil==7.0.0
ptyprocess==0.7.0
pure_eval==0.2.3
pybase64==1.4.3
pycparser==2.22
pydantic==2.11.7
pydantic_core==2.33.2
Pygments==2.19.2
pyparsing==3.2.3
pypdf==6.7.1
PyPika==0.51.1
pyproject_hooks==1.2.0
PyStemmer==2.2.0.3
python-dateutil==2.9.0.post0
python-dotenv==1.2.1
python-json-logger==3.3.0
pytz==2025.2
PyYAML==6.0.2
pyzmq==27.0.1
referencing==0.36.2
regex==2025.7.34
requests==2.32.4
requests-oauthlib==2.0.0
rfc3339-validator==0.1.4
rfc3986-validator==0.1.1
rfc3987-syntax==1.1.0
rich==14.3.3
rpds-py==0.27.0
safetensors==0.6.2
scikit-learn==1.7.1
scipy==1.16.1
seaborn==0.13.2
Send2Trash==1.8.3
sentence-transformers==5.1.0
setuptools==80.9.0
shellingham==1.5.4
six==1.17.0
sniffio==1.3.1
soupsieve==2.7
SQLAlchemy==2.0.42
stack-data==0.6.3
striprtf==0.0.26
sympy==1.14.0
tenacity==9.1.2
terminado==0.18.1
threadpoolctl==3.6.0
tiktoken==0.11.0
tinycss2==1.4.0
tokenizers==0.21.4
torch==2.8.0
tornado==6.5.2
tqdm==4.67.1
traitlets==5.14.3
transformers==4.55.0
typer==0.24.1
types-python-dateutil==2.9.0.20250809
typing-inspect==0.9.0
typing-inspection==0.4.1
typing_extensions==4.14.1
tzdata==2025.2
uri-template==1.3.0
urllib3==2.5.0
uvicorn==0.41.0
uvloop==0.22.1
watchfiles==1.1.1
wcwidth==0.2.13
webcolors==24.11.1
webencodings==0.5.1
websocket-client==1.8.0
websockets==16.0
widgetsnbextension==4.0.14
wrapt==1.17.2
yarl==1.20.1
zipp==3.23.0

140
retrieve.py Normal file
View file

@ -0,0 +1,140 @@
# retrieve.py
# Hybrid verbatim chunk retrieval: BM25 + vector search + cross-encoder, no LLM.
#
# Same hybrid retrieval as query_hybrid.py but outputs raw chunk text
# instead of LLM synthesis. Useful for inspecting what the hybrid pipeline
# retrieves.
#
# Each chunk is annotated with its source (vector, BM25, or both) so you can
# see which retriever nominated it.
#
# E.M.F. February 2026
# Environment vars must be set before importing huggingface/transformers
# libraries, because huggingface_hub.constants evaluates HF_HUB_OFFLINE
# at import time.
import os
os.environ["TOKENIZERS_PARALLELISM"] = "false"
os.environ["SENTENCE_TRANSFORMERS_HOME"] = "./models"
os.environ["HF_HUB_OFFLINE"] = "1"
from llama_index.core import (
StorageContext,
load_index_from_storage,
Settings,
)
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core.postprocessor import SentenceTransformerRerank
from llama_index.retrievers.bm25 import BM25Retriever
import sys
import textwrap
#
# Globals
#
# Embedding model (must match build_store.py)
EMBED_MODEL = HuggingFaceEmbedding(cache_folder="./models", model_name="BAAI/bge-large-en-v1.5", local_files_only=True)
# Cross-encoder model for re-ranking (cached in ./models/)
RERANK_MODEL = "cross-encoder/ms-marco-MiniLM-L-12-v2"
RERANK_TOP_N = 15
# Retrieval parameters
VECTOR_TOP_K = 20
BM25_TOP_K = 20
# Output formatting
WRAP_WIDTH = 80
def main():
# No LLM needed -- set embed model only
Settings.embed_model = EMBED_MODEL
# Load persisted vector store
storage_context = StorageContext.from_defaults(persist_dir="./store")
index = load_index_from_storage(storage_context)
# --- Retrievers ---
vector_retriever = index.as_retriever(similarity_top_k=VECTOR_TOP_K)
bm25_retriever = BM25Retriever.from_defaults(
index=index,
similarity_top_k=BM25_TOP_K,
)
# Cross-encoder re-ranker
reranker = SentenceTransformerRerank(
model=RERANK_MODEL,
top_n=RERANK_TOP_N,
)
# Query
if len(sys.argv) < 2:
print("Usage: python retrieve_hybrid_raw.py QUERY_TEXT")
sys.exit(1)
q = " ".join(sys.argv[1:])
# Retrieve from both sources
vector_nodes = vector_retriever.retrieve(q)
bm25_nodes = bm25_retriever.retrieve(q)
# Track which retriever found each node
vector_ids = {n.node.node_id for n in vector_nodes}
bm25_ids = {n.node.node_id for n in bm25_nodes}
# Merge and deduplicate by node ID
seen_ids = set()
merged = []
for node in vector_nodes + bm25_nodes:
node_id = node.node.node_id
if node_id not in seen_ids:
seen_ids.add(node_id)
merged.append(node)
# Re-rank merged candidates
reranked = reranker.postprocess_nodes(merged, query_str=q)
# Retrieval stats
n_both = len(vector_ids & bm25_ids)
n_vector_only = len(vector_ids - bm25_ids)
n_bm25_only = len(bm25_ids - vector_ids)
print(f"\nQuery: {q}")
print(f"Vector: {len(vector_nodes)}, BM25: {len(bm25_nodes)}, "
f"overlap: {n_both}, merged: {len(merged)}, re-ranked to: {len(reranked)}")
print(f" vector-only: {n_vector_only}, bm25-only: {n_bm25_only}, both: {n_both}\n")
# Output re-ranked chunks with source annotation
for i, node in enumerate(reranked, 1):
meta = getattr(node, "metadata", None) or node.node.metadata
score = getattr(node, "score", None)
file_name = meta.get("file_name", "unknown")
text = node.get_content()
node_id = node.node.node_id
# Annotate source
in_vector = node_id in vector_ids
in_bm25 = node_id in bm25_ids
if in_vector and in_bm25:
source = "vector+bm25"
elif in_bm25:
source = "bm25-only"
else:
source = "vector-only"
print("=" * WRAP_WIDTH)
print(f"=== [{i}] {file_name} (score: {score:.3f}) [{source}]")
print("=" * WRAP_WIDTH)
for line in text.splitlines():
if line.strip():
print(textwrap.fill(line, width=WRAP_WIDTH))
else:
print()
print()
if __name__ == "__main__":
main()

30
run_query.sh Executable file
View file

@ -0,0 +1,30 @@
#!/bin/bash
# This shell script will handle I/O for the python query engine
# It will take a query and return the formatted results
# E.M.F. August 2025
# Usage: ./run_query.sh
QUERY_SCRIPT="query_hybrid.py"
echo -e "Current query engine is $QUERY_SCRIPT\n"
# Loop until input is "exit"
while true; do
read -p "Enter your query (or type 'exit' to quit): " query
if [ "$query" == "exit" ] || [ "$query" == "quit" ] || [ "$query" == "" ] ; then
echo "Exiting..."
break
fi
time_start=$(date +%s)
# Call the python script with the query and format the output
python3 $QUERY_SCRIPT --query "$query" | \
expand | sed -E 's|(.* )(.*/data)|\1./data|' | fold -s -w 131
time_end=$(date +%s)
elapsed=$((time_end - time_start))
echo -e "Query processed in $elapsed seconds.\n"
echo $query >> query.log
done

189
search_keywords.py Normal file
View file

@ -0,0 +1,189 @@
# search_keywords.py
# Keyword search: extract terms from a query using POS tagging, then grep
# across journal files for matches.
#
# Complements the vector search pipeline by catching exact names, places,
# and dates that embeddings can miss. No vector store or LLM needed.
#
# Term extraction uses NLTK POS tagging to keep nouns (NN*), proper nouns
# (NNP*), and adjectives (JJ*) -- skipping stopwords and function words
# automatically. Consecutive proper nouns are joined into multi-word phrases
# (e.g., "Robert Wright" stays as one search term, not "robert" + "wright").
#
# E.M.F. February 2026
import os
import sys
import re
from pathlib import Path
import nltk
#
# Globals
#
DATA_DIR = Path("./data")
CONTEXT_LINES = 2 # lines of context around each match
MAX_MATCHES_PER_FILE = 3 # cap matches shown per file to avoid flooding
# POS tags to keep: nouns, proper nouns, adjectives
KEEP_TAGS = {"NN", "NNS", "NNP", "NNPS", "JJ", "JJS", "JJR"}
# Proper noun tags (consecutive runs are joined as phrases)
PROPER_NOUN_TAGS = {"NNP", "NNPS"}
# Minimum word length to keep (filters out short noise)
MIN_WORD_LEN = 3
def ensure_nltk_data():
"""Download NLTK data if not already present."""
for resource, name in [
("tokenizers/punkt_tab", "punkt_tab"),
("taggers/averaged_perceptron_tagger_eng", "averaged_perceptron_tagger_eng"),
]:
try:
nltk.data.find(resource)
except LookupError:
print(f"Downloading NLTK resource: {name}")
nltk.download(name, quiet=True)
def extract_terms(query):
"""Extract key terms from a query using POS tagging.
Tokenizes the query, runs POS tagging, and keeps nouns, proper nouns,
and adjectives. Consecutive proper nouns (NNP/NNPS) are joined into
multi-word phrases (e.g., "Robert Wright" "robert wright").
Returns a list of terms (lowercase), phrases listed first.
"""
tokens = nltk.word_tokenize(query)
tagged = nltk.pos_tag(tokens)
phrases = [] # multi-word proper noun phrases
single_terms = [] # individual nouns/adjectives
proper_run = [] # accumulator for consecutive proper nouns
for word, tag in tagged:
if tag in PROPER_NOUN_TAGS:
proper_run.append(word)
else:
# Flush any accumulated proper noun run
if proper_run:
phrase = " ".join(proper_run).lower()
if len(phrase) >= MIN_WORD_LEN:
phrases.append(phrase)
proper_run = []
# Keep other nouns and adjectives as single terms
if tag in KEEP_TAGS and len(word) >= MIN_WORD_LEN:
single_terms.append(word.lower())
# Flush final proper noun run
if proper_run:
phrase = " ".join(proper_run).lower()
if len(phrase) >= MIN_WORD_LEN:
phrases.append(phrase)
# Phrases first (more specific), then single terms
all_terms = phrases + single_terms
return list(dict.fromkeys(all_terms)) # deduplicate, preserve order
def search_files(terms, data_dir, context_lines=CONTEXT_LINES):
"""Search all .txt files in data_dir for the given terms.
Returns a list of (file_path, match_count, matches) where matches is a
list of (line_number, context_block) tuples.
"""
if not terms:
return []
# Build a single regex pattern that matches any term (case-insensitive)
pattern = re.compile(
r"\b(" + "|".join(re.escape(t) for t in terms) + r")\b",
re.IGNORECASE
)
results = []
txt_files = sorted(data_dir.glob("*.txt"))
for fpath in txt_files:
try:
lines = fpath.read_text(encoding="utf-8").splitlines()
except (OSError, UnicodeDecodeError):
continue
matches = []
match_count = 0
seen_lines = set() # avoid overlapping context blocks
for i, line in enumerate(lines):
if pattern.search(line):
match_count += 1
if i in seen_lines:
continue
# Extract context window
start = max(0, i - context_lines)
end = min(len(lines), i + context_lines + 1)
block = []
for j in range(start, end):
seen_lines.add(j)
marker = ">>>" if j == i else " "
block.append(f" {marker} {j+1:4d}: {lines[j]}")
matches.append((i + 1, "\n".join(block)))
if match_count > 0:
results.append((fpath, match_count, matches))
# Sort by match count (most matches first)
results.sort(key=lambda x: x[1], reverse=True)
return results
def main():
if len(sys.argv) < 2:
print("Usage: python search_keywords.py QUERY_TEXT")
sys.exit(1)
ensure_nltk_data()
q = " ".join(sys.argv[1:])
# Extract terms
terms = extract_terms(q)
if not terms:
print(f"Query: {q}")
print("No searchable terms extracted. Try a more specific query.")
sys.exit(0)
print(f"Query: {q}")
print(f"Extracted terms: {', '.join(terms)}\n")
# Search
results = search_files(terms, DATA_DIR)
if not results:
print("No matches found.")
sys.exit(0)
# Summary
total_matches = sum(r[1] for r in results)
print(f"Found {total_matches} matches across {len(results)} files\n")
# Detailed output
for fpath, match_count, matches in results:
print("="*60)
print(f"--- {fpath.name} ({match_count} matches) ---")
print("="*60)
for line_num, block in matches[:MAX_MATCHES_PER_FILE]:
print(block)
print()
if len(matches) > MAX_MATCHES_PER_FILE:
print(f" ... and {len(matches) - MAX_MATCHES_PER_FILE} more matches\n")
if __name__ == "__main__":
main()