ssearch/devlog.md
Eric Furst eb9997326f Shell script run_retrieve.sh for non-LLM
gneration queries (returns only chunks), track
development notes and README.
2026-03-01 07:39:28 -05:00

42 KiB
Raw Blame History

ssearch development log

Active files (after Feb 27 reorganization)

  • build_store.py — build/update journal vector store (incremental)
  • query_hybrid.py — hybrid BM25 + vector query with LLM synthesis
  • retrieve.py — hybrid verbatim chunk retrieval (no LLM)
  • search_keywords.py — keyword search via POS-based term extraction
  • run_query.sh — shell wrapper for interactive querying
  • clippings_search/build_clippings.py — build/update clippings vector store (ChromaDB)
  • clippings_search/retrieve_clippings.py — verbatim clippings retrieval
  • deploy_public.sh — deploy public files to Forgejo

Earlier scripts moved to archived/: build.py, build_exp.py, query_topk.py, query_catalog.py, query_exp.py, query_topk_prompt.py, query_topk_prompt_engine.py, query_topk_prompt_dw.py, query_rewrite_hyde.py, query_multitool.py, shared/build.py, shared/query.py, vs_metrics.py, claude_diagnostic.py, query_claude_sonnet.py, query_tree.py, query_topk_prompt_engine_v3.py, retrieve_raw.py

Best configuration

  • Embedding: BAAI/bge-large-en-v1.5, 256 token chunks, 25 token overlap
  • Re-ranker: cross-encoder/ms-marco-MiniLM-L-12-v2 (retrieve top-30, re-rank to top-15)
  • LLM: command-r7b via Ollama (temperature 0.3). OpenAI gpt-4o-mini available as alternative.
  • Retrieval: hybrid BM25 + vector, cross-encoder re-ranked

To do

  1. [DONE] Test v3 (cross-encoder re-ranking) and compare results with v2. Selected ms-marco-MiniLM-L-12-v2 after testing three models.

  2. [DONE] Verbatim retrieval mode (retrieve_raw.py). Uses index.as_retriever() instead of index.as_query_engine() to get chunks without LLM synthesis. Re-ranks with the same cross-encoder, then outputs raw chunk text with metadata and scores.

  3. [DONE] Keyword search pipeline (search_keywords.py). Extracts nouns and adjectives via NLTK POS tagging, then greps data files. Complements vector search for exact names, places, dates.

  4. [DONE] BM25 hybrid retrieval (sparse + dense). Two scripts: query_hybrid.py (with LLM synthesis) and retrieve.py (verbatim chunks, no LLM). Both run BM25 (top-20) and vector (top-20) retrievers, merge/deduplicate, then cross-encoder re-rank to top-15. Uses llama-index-retrievers-bm25.

  5. Explore query expansion (multiple phrasings, merged retrieval)

  6. Explore different vector store strategies (database)

  7. [DONE] Test ChatGPT API for final LLM generation (instead of local Ollama)

  8. [DONE] Remove API key from this file. Moved to ~/.bashrc as OPENAI_API_KEY.

    The retrieval pipeline (embedding, vector search, cross-encoder re-ranking) stays the same. Only the final synthesis LLM changes.

    Steps:

    1. Install the LlamaIndex OpenAI integration:
      pip install llama-index-llms-openai
      
    2. Set API key as environment variable:
      export OPENAI_API_KEY="sk-..."
      
      (Or store in a .env file and load with python-dotenv. Do NOT commit the key to version control.)
    3. In the query script, replace the Ollama LLM with OpenAI:
      # Current (local):
      from llama_index.llms.ollama import Ollama
      Settings.llm = Ollama(
          model="command-r7b",
          request_timeout=360.0,
          context_window=8000,
      )
      
      # New (API):
      from llama_index.llms.openai import OpenAI
      Settings.llm = OpenAI(
          model="gpt-4o-mini",   # or "gpt-4o" for higher quality
          temperature=0.1,
      )
      
    4. Run the query script as usual. Everything else (embedding model, vector store, cross-encoder re-ranker, prompt) is unchanged.
    5. Compare output quality and response time against command-r7b.

    Models to try: gpt-4o-mini (cheap, fast), gpt-4o (better quality). The prompt should work without modification since it's model-agnostic — just context + instructions.

    Note: This adds an external API dependency and per-query cost. The embedding and re-ranking remain fully local/offline.

    API KEY: moved to ~/.bashrc as OPENAI_API_KEY (do not store in repo)

    Getting an OpenAI API key:

    1. Go to https://platform.openai.com/ and sign up (or log in).
    2. Navigate to API keys: Settings > API keys (or https://platform.openai.com/api-keys).
    3. Click "Create new secret key", give it a name, and copy it. The key starts with sk- and is shown only once.
    4. Add billing: Settings > Billing. Load a small amount ($5-10) to start. API calls are pay-per-use, not a subscription.
    5. Set the key in your shell before running a query:
      export OPENAI_API_KEY="sk-..."
      
      Or add to ~/.zshrc (or ~/.bashrc) to persist across sessions. Do NOT commit the key to version control or put it in scripts.

    Approximate cost per query (Feb 2026):

    • gpt-4o-mini: ~$0.001-0.003 (15 chunks of context)
    • gpt-4o: ~$0.01-0.03

February 27, 2026

Project reorganization

Reorganized the project structure with Claude Code. Goals: drop legacy version numbers from filenames, archive superseded scripts, group clippings search into a subdirectory, and clean up storage directory names.

Script renames:

  • build_exp_claude.pybuild_store.py
  • query_hybrid_bm25_v4.pyquery_hybrid.py
  • retrieve_hybrid_raw.pyretrieve.py

Archived (moved to archived/):

  • query_topk_prompt_engine_v3.py — superseded by hybrid BM25+vector query
  • retrieve_raw.py — superseded by hybrid retrieval

Clippings search subdirectory:

  • build_clippings.pyclippings_search/build_clippings.py
  • retrieve_clippings.pyclippings_search/retrieve_clippings.py
  • Scripts use ./ paths relative to project root, so no path changes needed when run as python clippings_search/build_clippings.py from root.

Storage renames:

  • storage_exp/store/ (journal vector store)
  • storage_clippings/clippings_search/store_clippings/ (clippings vector store)
  • Deleted unused storage/ (original August 2025 store, never updated)

Updated references in run_query.sh, .gitignore, CLAUDE.md, README.md, and all Python scripts that referenced old storage paths.

Deploy script (deploy_public.sh)

Created deploy_public.sh to automate publishing to Forgejo. Previously, maintaining the public branch required manually recreating an orphan branch, copying files, editing the README, and force-pushing — error-prone and tedious.

The script:

  1. Checks that we're on main with no uncommitted changes
  2. Deletes the local public branch and creates a fresh orphan
  3. Copies listed public files from main (via git checkout main -- <file>)
  4. Generates a public README by stripping private sections (Notebooks, Development history) and private file references using awk
  5. Stages only the listed files (not untracked files on disk)
  6. Commits with a message and force-pushes to origin/public
  7. Switches back to main

Fixed a bug where git add . picked up untracked files (output_test.txt, run_retrieve.sh). Changed to git add "${PUBLIC_FILES[@]}" README.md.

Forgejo setup

Set up SSH push to Forgejo instance. Required adding SSH public key to Forgejo user settings. The remote uses a Tailscale address.

MIT License

Added MIT License (Copyright (c) 2026 E. M. Furst) to both main and public branches.

Devlog migration

Migrated devlog.txt to devlog.md with markdown formatting.


February 20, 2026

Offline use: environment variables must be set before imports

Despite setting HF_HUB_OFFLINE=1 and SENTENCE_TRANSFORMERS_HOME=./models (added Feb 16), the scripts still failed offline with a ConnectionError trying to reach huggingface.co. The error came from AutoTokenizer.from_pretrained() calling list_repo_templates(), which makes an HTTP request to the HuggingFace API.

Root cause: the huggingface_hub library evaluates HF_HUB_OFFLINE at import time, not at call time. The constant is set once in huggingface_hub/constants.py:

HF_HUB_OFFLINE = _is_true(os.environ.get("HF_HUB_OFFLINE")
                           or os.environ.get("TRANSFORMERS_OFFLINE"))

In all four scripts, the os.environ lines came AFTER the imports:

from llama_index.embeddings.huggingface import HuggingFaceEmbedding  # triggers import of huggingface_hub
from llama_index.core.postprocessor import SentenceTransformerRerank
import os

os.environ["HF_HUB_OFFLINE"] = "1"   # too late, constant already False

By the time os.environ was set, huggingface_hub had already imported and locked the constant to False. The env var existed in the process environment but the library never re-read it.

Fix: moved import os and all three os.environ calls to the top of each file, before any llama_index or huggingface imports:

import os
os.environ["TOKENIZERS_PARALLELISM"] = "false"
os.environ["SENTENCE_TRANSFORMERS_HOME"] = "./models"
os.environ["HF_HUB_OFFLINE"] = "1"

from llama_index.core import ...          # now these see the env vars
from llama_index.embeddings.huggingface import HuggingFaceEmbedding

Updated scripts: query_topk_prompt_engine_v3.py, retrieve_raw.py, query_hybrid_bm25_v4.py, retrieve_hybrid_raw.py.

General lesson for offline HuggingFace use:

The HuggingFace ecosystem has multiple libraries that check for offline mode:

  • huggingface_hub: reads HF_HUB_OFFLINE (or TRANSFORMERS_OFFLINE) at import
  • transformers: delegates to huggingface_hub's constant
  • sentence-transformers: delegates to huggingface_hub's constant

All of them evaluate the flag ONCE at module load time. This means:

  1. os.environ must be set before ANY import that touches huggingface_hub
  2. Setting the env var in a "Globals" section after imports does NOT work
  3. Even indirect imports count — llama_index.embeddings.huggingface transitively imports huggingface_hub, so the flag must precede it
  4. Alternatively, set the env var in the shell before running Python:
    export HF_HUB_OFFLINE=1
    
    This always works because it's set before any Python code runs.
  5. The newer transformers library (v4.50+) added list_repo_templates() in AutoTokenizer.from_pretrained(), which makes network calls that weren't present in earlier versions. This is why the Feb 16 fix worked initially (or appeared to) but broke after a package update.

This is a common pitfall for anyone running HuggingFace models offline (e.g., on a laptop without network, air-gapped environments, or behind restrictive firewalls). The models are cached locally and work fine — but the library still tries to check for updates unless the offline flag is set correctly.


Incremental vector store updates

Added incremental update mode to build_store.py (then build_exp_claude.py). Previously the script rebuilt the entire vector store from scratch every run (~1848 files). Now it defaults to incremental mode: loads the existing index, compares against ./data, and only processes new, modified, or deleted files.

Usage:

python build_store.py            # incremental update (default)
python build_store.py --rebuild  # full rebuild from scratch

How it works:

  • The LlamaIndex docstore (store/docstore.json) already tracks every indexed document with metadata: file_name, file_size, last_modified_date.
  • The script scans ./data/*.txt and classifies each file:
    • New: file_name not in docstore → insert
    • Modified: file_size or last_modified_date differs → delete + re-insert
    • Deleted: in docstore but not on disk → delete
    • Unchanged: skip
  • Uses index.insert() and index.delete_ref_doc() from the LlamaIndex API.
  • The same SentenceSplitter (256 tokens, 25 overlap) is applied via Settings.transformations so chunks match the original build.

Timing: incremental update with nothing to do takes ~17s (loading the index). Full rebuild takes several minutes. First incremental run after a stale index found 8 new files and 204 modified files, completed in ~65s.

Important detail: SimpleDirectoryReader converts file timestamps to UTC (datetime.fromtimestamp(mtime, tz=timezone.utc)) before formatting as YYYY-MM-DD. The comparison logic must use UTC too, or files modified late in the day will show as "modified" due to the date rolling forward in UTC. This caused a false-positive bug on the first attempt.

This enables running the build as a cron job to keep the vector store current as new journal entries are added.


February 18, 2026

LLM comparison: gpt-4o-mini (OpenAI API) vs command-r7b (local Ollama)

Test query: "Passages that quote Louis Menand." (hybrid BM25+vector, v4) Retrieval was identical (same 15 chunks, same scores) — only synthesis differs. Results saved in tests/results_openai.txt and tests/results_commandr7b.txt.

gpt-4o-mini:

  • Cited 6 files (2025-11-04, 2025-02-14, 2022-08-14, 2025-07-27, 2025-02-05, 2024-09-04). Drew from chunks ranked as low as #14.
  • Better at distinguishing direct quotes from paraphrases and indirect references. Provided a structured summary with numbered entries.
  • 44 seconds total (most of that is local retrieval/re-ranking; the API call itself is nearly instant).

command-r7b:

  • Cited 2 files (2025-11-04, 2022-08-14). Focused on the top-scored chunks and ignored lower-ranked ones.
  • Pulled out actual quotes verbatim as block quotes — more useful if you want the exact text rather than a summary.
  • 78 seconds total.

Summary: gpt-4o-mini is broader (more sources, better use of the full context window) and nearly 2x faster. command-r7b is more focused and reproduces exact quotes. Both correctly identified the core passages. The quality difference is noticeable but not dramatic — the retrieval pipeline does most of the heavy lifting.

Temperature experiments

The gpt-4o-mini test used temperature=0.1 (nearly deterministic). command-r7b via Ollama defaults to temperature=0.8 — so the two models were tested at very different temperatures, which may account for some of the stylistic difference.

Temperature guidance for RAG synthesis:

Range Behavior Use case
0.00.1 Nearly deterministic. Picks highest-probability tokens. Factual extraction, consistency. Can "tunnel vision."
0.30.5 Moderate. More varied phrasing, draws connections across chunks. Good middle ground for RAG (prompt already constrains context).
0.71.0 Creative/varied. Riskier for RAG — may paraphrase loosely. Not ideal for faithfulness to source text.

Follow-up: temperature=0.3 for both models (same query, same retrieval)

command-r7b at 0.3 (was 0.8): Major improvement. Cited 6 files (was 2). Drew from lower-ranked chunks including #15. Used the full context window instead of fixating on top hits. Took 94s (was 78s) due to more output.

gpt-4o-mini at 0.3 (was 0.1): Nearly identical to 0.1 run. Same 6 files, same structure. Slightly more interpretive phrasing but no meaningful change. This model is less sensitive to temperature for RAG synthesis.

Key finding: Temperature is a critical but often overlooked parameter when evaluating the generation stage of a RAG pipeline. In our tests, a local 7B model (command-r7b) went from citing 2 sources to 6 — a 3x improvement in context utilization — simply by lowering temperature from 0.8 to 0.3. At the higher temperature, the model "wandered" during generation, focusing on the most salient chunks and producing repetitive output. At the lower temperature, it methodically worked through the full context window.

Implications for RAG evaluation methodology:

  1. When comparing LLMs for RAG synthesis, temperature must be controlled across models. Our initial comparison (gpt-4o-mini at 0.1 vs command-r7b at 0.8 default) overstated the quality gap between models.
  2. The "right" temperature for RAG is lower than for open-ended generation. The prompt and retrieved context already constrain the task; high temperature adds noise rather than creativity.
  3. Temperature affects context utilization, not just style. A model that appears to "ignore" lower-ranked chunks may simply need a lower temperature to attend to them.
  4. At temperature=0.3, a local 7B model and a cloud API model converged on similar quality (6 files cited, good coverage, mix of quotes and paraphrase). The retrieval pipeline does most of the heavy lifting; the generation model's job is to faithfully synthesize what was retrieved.

Testing method: Hold retrieval constant (same query, same vector store, same re-ranker, same top-15 chunks). Vary only the LLM and temperature. Compare on: number of source files cited, whether lower-ranked chunks are used, faithfulness to source text, and total query time. Results saved in tests/ with naming convention results_<model>_t<temp>.txt.


LlamaIndex upgrade to 0.14.14

Upgraded LlamaIndex from 0.13.1 to 0.14.14 to add OpenAI API support.

Installing llama-index-llms-openai pulled in llama-index-core 0.14.14, which was incompatible with the existing companion packages (all pinned to <0.14). Fixed by upgrading all companion packages together:

pip install --upgrade llama-index-embeddings-huggingface \
    llama-index-readers-file llama-index-llms-ollama \
    llama-index-retrievers-bm25

Final package versions:

Package Version Was
llama-index-core 0.14.14 0.13.1
llama-index-embeddings-huggingface 0.6.1 0.6.0
llama-index-llms-ollama 0.9.1 0.7.0
llama-index-llms-openai 0.6.18 new
llama-index-readers-file 0.5.6 0.5.0
llama-index-retrievers-bm25 0.6.5 unchanged
llama-index-workflows 2.14.2 1.3.0

Smoke test: retrieve_raw.py "mining towns" — works, same results as before. No vector store rebuild needed. The existing store loaded fine with 0.14.


Paragraph separator validation

Checked whether paragraph_separator="\n\n" in build_store.py makes sense for the journal data.

Results from scanning all 1,846 files in ./data/:

  • 1,796 files (97%) use \n\n as paragraph boundaries
  • 28 files use single newlines only
  • 22 files have no newlines at all
  • Average paragraphs per file: 10.8 (median 7, range 0206)
  • 900 files (49%) also use --- as a topic/section separator

The \n\n setting is correct. SentenceSplitter tries to break at paragraph_separator boundaries first, then falls back to sentence boundaries, then words. With 256-token chunks, this keeps semantically related sentences together within a paragraph.

The --- separators are already surrounded by \n\n (e.g., \n\n---\n\n), so they naturally act as break points too. No special handling needed.

Note: "\n\n" is actually the default value for paragraph_separator in LlamaIndex's SentenceSplitter. The explicit setting documents intent but is functionally redundant.

List-style entries with single newlines between items (e.g., 2001-09-14.txt) stay together within a chunk, which is desirable — lists shouldn't be split line by line.


February 16, 2026

Cross-encoder model caching for offline use

Cached the cross-encoder model (cross-encoder/ms-marco-MiniLM-L-12-v2) in ./models/ for offline use. Previously, HuggingFaceEmbedding already used cache_folder="./models" with local_files_only=True for the embedding model, but the cross-encoder (loaded via SentenceTransformerRerankCrossEncoder) had no cache_folder parameter and would fail offline when it tried to phone home for updates.

Fix: all scripts that use the cross-encoder now set two environment variables before model initialization:

os.environ["SENTENCE_TRANSFORMERS_HOME"] = "./models"
os.environ["HF_HUB_OFFLINE"] = "1"

SENTENCE_TRANSFORMERS_HOME directs the CrossEncoder to look in ./models/ for cached weights. HF_HUB_OFFLINE prevents any network access attempt.

The model was cached using huggingface_hub.snapshot_download():

from huggingface_hub import snapshot_download
snapshot_download('cross-encoder/ms-marco-MiniLM-L-12-v2', cache_dir='./models')

Models now in ./models/:

  • models--BAAI--bge-large-en-v1.5 (embedding, bi-encoder)
  • models--cross-encoder--ms-marco-MiniLM-L-12-v2 (re-ranker, cross-encoder)
  • models--sentence-transformers--all-mpnet-base-v2 (old embedding, kept)

February 15, 2026

Design note on search_keywords.py

The POS tagger has a fundamental limitation: it was trained on declarative prose, not imperative queries. A query like "Find passages that mention Louis Menand" causes the tagger to classify "find" and "mention" as nouns (NN) rather than verbs, because the imperative sentence structure is unusual in its training data. This floods results with false positives (304 matches across 218 files instead of the handful mentioning Menand).

More fundamentally: for term-based searches, the POS tagging layer adds minimal value over bare grep. If the input is "Louis Menand", POS tagging extracts "louis menand" — identical to what grep would match. The tool's real value is not the NLP layer but the convenience wrapper: searching all files at once, joining multi-word proper nouns, sorting by match count, and showing context around matches. It's essentially a formatted multi-file grep.

Possible future direction: merge keyword search results with semantic search results. The keyword pipeline catches exact names, places, and dates that embeddings miss, while the semantic pipeline catches thematic relevance that keywords miss. A hybrid approach could combine both result sets, using keyword matches to boost or supplement vector retrieval. This connects to the BM25 hybrid retrieval idea (to-do item 4).

New scripts: query_hybrid_bm25_v4.py and retrieve_hybrid_raw.py

Implemented BM25 hybrid retrieval (to-do item 4). Both scripts run two retrievers in parallel on the same query:

  • Vector retriever: top-20 by cosine similarity (semantic meaning)
  • BM25 retriever: top-20 by term frequency (exact lexical matching)

Results are merged and deduplicated by node ID, then passed to the cross-encoder re-ranker (ms-marco-MiniLM-L-12-v2) → top-15.

query_hybrid_bm25_v4.py feeds the re-ranked chunks to the LLM (same v3 prompt and command-r7b model). retrieve_hybrid_raw.py outputs the raw chunks with source annotations: [vector-only], [bm25-only], or [vector+bm25], showing which retriever nominated each result.

The BM25 retriever uses BM25Retriever.from_defaults(index=index) from llama-index-retrievers-bm25 (v0.6.5). It indexes the nodes already stored in the persisted vector store — no separate build step needed.

Key idea: BM25's job is only to nominate candidates that vector similarity might miss (exact names, dates, specific terms). The cross-encoder decides final relevance regardless of where candidates came from.


February 12, 2026

Updated vector store, now 4816 chunks.

Scope of a language model based search: LLMs can summarize, but lack the ability to critically read and compare information. ChatGPT can summarize the literature that I've cited, but it cannot critique it. (It could generate from published critiques.) Our ability to critically read and synthesize from literature is an important skill. (Most reviews fall far short, simply aggregating "advances" without asking why, how, or whether they are real or not.)


February 11, 2026

Project tidy-up and cross-encoder re-ranking (v3)

Tidied up the project with Claude Code:

  • Generated README.md and CLAUDE.md documentation
  • Archived superseded scripts (v1 query engines, old build scripts, shared/, experimental/query_multitool.py)
  • Removed stale storage_exp copy (Aug 2025 backup, ~105 MB)
  • Removed empty shared/ and experimental/ directories

Created query_topk_prompt_engine_v3.py: adds cross-encoder re-ranking.

The idea: the current pipeline (v2) uses a bi-encoder (BAAI/bge-large-en-v1.5) that encodes query and chunks independently, then compares via cosine similarity. This is fast but approximate — the query and chunk never "see" each other.

A cross-encoder takes the query and chunk as a single concatenated input, with full attention between all tokens. It scores the pair jointly, which captures nuance that dot-product similarity misses (paraphrase, negation, indirect relevance). The tradeoff is speed: you can't pre-compute scores.

v3 uses a two-stage approach:

  1. Retrieve top-30 via bi-encoder (fast, approximate)
  2. Re-rank to top-15 with cross-encoder (slow, precise)
  3. Pass re-ranked chunks to LLM for synthesis

Cross-encoder model: cross-encoder/ms-marco-MiniLM-L-6-v2 (~80 MB, 6 layers). Trained on MS MARCO passage ranking. Should add only a few seconds to query time for 30 candidates.

Bi-encoder vs cross-encoder

Bi-encoder (what the pipeline had): The embedding model (BAAI/bge-large-en-v1.5) encodes the query and each chunk independently into vectors. Similarity is a dot product between two vectors that were computed separately. This is fast — you can pre-compute all chunk vectors once at build time and just compare against the query vector at search time. But because query and chunk never "see" each other during encoding, the model can miss subtle relevance signals.

Cross-encoder (what v3 adds): A cross-encoder takes the query and a chunk as a single input pair: [query, chunk] concatenated together. It reads both simultaneously through the transformer, with full attention between every token in the query and every token in the chunk. It outputs a single relevance score. This is much more accurate because the model can reason about the specific relationship between your question and the passage — word overlap, paraphrase, negation, context.

The tradeoff: it's slow. You can't pre-compute anything because the score depends on the specific query. Scoring 4,692 chunks this way would take too long.

Why the two-stage approach works:

4,692 chunks  →  bi-encoder (fast, approximate)  →  top 30
    top 30    →  cross-encoder (slow, precise)    →  top 15
    top 15    →  LLM synthesis                    →  response

Concrete example: If you search "times the author felt conflicted about career choices," the bi-encoder might rank a chunk about "job satisfaction" highly because the vectors are close. But a chunk that says "I couldn't decide whether to stay or leave" — without using the word "career" — might score lower in vector space. The cross-encoder, reading both query and chunk together, would recognize that "couldn't decide whether to stay or leave" is highly relevant to "felt conflicted about career choices."

Prompt update for v3

Updated the v3 prompt to account for re-ranked context. Changes:

  • Tells the LLM the context is from a "personal journal collection" and has been "selected and ranked for relevance"
  • "Examine ALL provided excerpts, not just the top few" — counters single-file collapse seen in initial testing
  • "When multiple files touch on the query, note what each one contributes" — encourages breadth across sources
  • "End with a list of all files that contributed" — stronger than v2's vague "list all relevant source files"

Also updated run_query.sh to point to v3.

v3 test results

Query: "Passages that describe mining towns."

  • Response cited 2 passages from 2023-03-15.txt (coal mining, great-grandfather)
  • Source documents included 7 distinct files across 15 chunks
  • Top cross-encoder score: -1.177 (2025-09-14.txt)
  • LLM focused on 2023-03-15.txt which had the most explicit mining content
  • Query time: 76 seconds
  • Note: cross-encoder scores are raw logits (negative), not 01 cosine similarity

Query: "I am looking for entries that discuss memes and cognition."

  • Response cited 6 distinct files with specific content from each: 2025-07-14 (Dennett/Blackmore on memes), 2023-09-20 (Hurley model), 2024-03-24 (multiple drafts model), 2021-04-25 (consciousness discussion), 2026-01-08 (epistemological frameworks), 2025-03-10 (Extended Mind Theory)
  • Top cross-encoder score: 4.499 (2026-01-08.txt) — clear separation from rest
  • LLM drew from chunks ranked 3rd, 4th, 5th, 12th, and 15th — confirming it examines the full context, not just top hits
  • Query time: 71 seconds

Observations:

  • The v3 prompt produces much better multi-source synthesis than v2's prompt
  • Cross-encoder scores show clear separation between strong and weak matches
  • The re-ranker + new prompt together encourage breadth across files
  • Query time comparable to v2 (~7080 seconds)

Cross-encoder model comparison

Tested three cross-encoder models on the same query ("Discussions of Kondiaronk and the Wendats") to compare re-ranking behavior.

1. cross-encoder/ms-marco-MiniLM-L-12-v2 (baseline)

  • Scores: raw logits, wide spread (top score 3.702)
  • Clear separation between strong and weak matches
  • Balanced ranking: 2025-06-07.txt #1, 2025-07-28.txt #2, 2024-12-25.txt #3
  • Query time: ~7080 seconds
  • Trained on MS MARCO passage ranking (query → relevant passage)

2. cross-encoder/stsb-roberta-base

  • Scores: 0.308 to 0.507 — very compressed range (0.199 spread)
  • Poor differentiation: model can't clearly separate relevant from irrelevant
  • Pulled in 2019-07-03.txt at #2 (not in L-12 results), dropped 2024-12-25.txt
  • Query time: 92 seconds
  • Trained on STS Benchmark (semantic similarity, not passage ranking) — wrong task for re-ranking. Measures "are these texts about the same thing?" rather than "is this passage a good answer to this query?"

3. BAAI/bge-reranker-v2-m3

  • Scores: calibrated probabilities (01). Sharp top (0.812), then 0.313, 0.262… Bottom 6 chunks at 0.001 (model says: not relevant at all)
  • Very confident about #1 (2025-07-28.txt at 0.812), but long zero tail
  • 5 of 15 chunks from 2025-07-28.txt — heavy concentration on one file
  • Query time: 125 seconds (50% slower than L-12)
  • Multilingual model, larger than ms-marco MiniLM variants

Summary:

Model Score spread Speed Differentiation
ms-marco-MiniLM-L-12-v2 Wide (logits) ~7080s Good, balanced
BAAI/bge-reranker-v2-m3 Sharp top/zeros ~125s Confident #1, weak tail
stsb-roberta-base Compressed ~92s Poor

Decision: ms-marco-MiniLM-L-12-v2 is the best fit. Purpose-built for passage ranking, fastest of the three, and produces balanced rankings with good score separation. The BAAI model's zero-tail problem means 6 of 15 chunks are dead weight in the context window (could be mitigated by lowering RERANK_TOP_N or adding a score cutoff, but adds complexity for marginal gain). The stsb model is simply wrong for this task — semantic similarity ≠ passage relevance.

New scripts: retrieve_raw.py and search_keywords.py

retrieve_raw.py — Verbatim chunk retrieval, no LLM. Uses the LlamaIndex retriever API instead of the query engine:

# v3 uses as_query_engine() — full pipeline including LLM synthesis
query_engine = index.as_query_engine(
    similarity_top_k=30,
    text_qa_template=PROMPT,
    node_postprocessors=[reranker],
)
response = query_engine.query(q)   # returns LLM-generated text

# retrieve_raw.py uses as_retriever() — stops after retrieval
retriever = index.as_retriever(similarity_top_k=30)
nodes = retriever.retrieve(q)       # returns raw NodeWithScore objects
reranked = reranker.postprocess_nodes(nodes, query_str=q)

The key distinction: as_query_engine() wraps retrieval + synthesis into one call (retriever → node postprocessors → response synthesizer → LLM). as_retriever() returns just the retriever component, giving back the raw nodes with their text and metadata. The re-ranker's postprocess_nodes() method can still be called manually on the retrieved nodes.

Each node has:

  • node.get_content() — the chunk text
  • node.metadata — dict with file_name, file_path, etc.
  • node.score — similarity or re-ranker score

This separation is useful for inspecting what the pipeline retrieves before the LLM processes it, and for building alternative output formats.

search_keywords.py — Keyword search via NLTK POS tagging. Completely separate from the vector store pipeline. Extracts nouns (NN, NNS, NNP, NNPS) and adjectives (JJ, JJR, JJS) from the query using nltk.pos_tag(), then searches ./data/*.txt with regex. Catches exact terms that embeddings miss. NLTK data (punkt_tab, averaged_perceptron_tagger_eng) is auto-downloaded on first run.


January 12, 2026

Best practices for query rewriting

  1. Understand the original intent: Clarify the core intent behind the query. Sometimes that means expanding a terse question into a more descriptive one, or breaking a complex query into smaller, more focused sub-queries.

  2. Leverage LlamaIndex's built-in rewriting tools: LlamaIndex has query transformation utilities that can help automatically rephrase or enrich queries. Use them as a starting point and tweak the results.

  3. Using a model to generate rewrites: Have a language model generate a "clarified" version of the query. Feed the model the initial query and ask it to rephrase or add context.

Step-by-step approach:

  • Initial query expansion: Take the raw user query and expand it with natural language context.
  • Model-assisted rewriting: Use a model to generate alternate phrasings. Prompt with something like, "Please rewrite this query in a more detailed form for better retrieval results."
  • Testing and iteration: Test rewritten versions and see which yield the best matches.

January 1, 2026

Updated storage_exp by running build_exp.py.


September 6, 2025

Rebuilt storage_exp: 2048 embeddings. Took about 4 minutes.

Need to experiment more with query rewrites. Save the query but match on extracted terms? You can imagine an agent that decides between a search like grep and a more semantic search. The search is not good at finding dates ("What did the author say on DATE") or when searching for certain terms ("What did the author say about libraries?").


August 28, 2025

Email embedding experiment

Idea: given a strong (or top) hit, use this node to find similar chunks.

Working with demo. Saved 294 emails from president@udel.edu. Embedding these took nearly 45 minutes. The resulting vector store is larger than the journals. The search is ok, but could be optimized by stripping the headers.

To make the text files:

textutil -convert txt *.eml

The resulting text: 145,204 lines, 335,690 words, 9,425,696 characters total (~9.4 MB of text).

$ python build.py
Parsing nodes: 100%|████████| 294/294 [00:31<00:00,  9.28it/s]
Generating embeddings: ... (19 batches of 2048)

Total = 2,571 seconds = 42 minutes 51 seconds.

Vector store size:

$ ls -lh storage/
-rw-r--r--  867M  default__vector_store.json
-rw-r--r--  100M  docstore.json
-rw-r--r--   18B  graph_store.json
-rw-r--r--   72B  image__vector_store.json
-rw-r--r--  3.1M  index_store.json

That's a big vector store! The journals have a vector store that is only 90M (an order of magnitude smaller) from a body of texts that is ~3 MB.

After extracting just the text/html from the eml files: 21,313 lines, 130,901 words, 946,474 characters total — much smaller. Build time dropped to ~1:15. Store size dropped to ~25 MB.


August 27, 2025

The wrapped query works great on the decwriter! Queries take about 83 seconds, and sometimes up to 95 seconds if the model needs to be loaded. Longest query so far (had to load all models) is 98 seconds.


August 26, 2025

  • Started an "experimental" folder for combining semantic + LLM-guided regex search.
  • Created an "archive" folder for older versions.
  • Wrote a shell script wrapper and a version that takes input on the command line.

Timed the retrieval (backup was running, so probably longer):

real    1m20.971s
user    0m13.074s
sys     0m1.429s

August 25, 2025

  • Build a bash wrapper around the python query engine. The bash wrapper would handle input and output.
  • Expand the search to extract keywords and do a regex search on those. Can you search the real text chunks and sort by a similarity calc?
  • What if you returned more results and sorted these by a cluster grouping?

August 21, 2025

HyDE experiments

HyDE stands for Hypothetical Document Embeddings.

Took out HyDE to test generation. Not sure HyDE is doing anything. Indeed, it is not generating results that are any better or different than just using the BAAI/bge-large-en-v1.5 embedding model and a custom prompt. The BAAI/bge model gives very good results!

Compared llama3.1:8B with command-r7b. Both are about the same size and give similar results. ChatGPT is pretty adamant that command-r7b will stick more to the retrieved content. This is reinforced by the following exercise:

command-r7b output (RAG faithfulness test):

The last day you can file your 2023 taxes without incurring any penalties is April 15th, 2024. This is the official filing deadline for the 2023 tax year. Filing after this date will result in a late fee, with a 5% penalty per month up to a maximum of 25%.

llama3.1:7b output:

April 15th, 2024.

Note: The context only mentions the filing deadline and late fees, not any possible extensions or exceptions.

ChatGPT says: LLaMA 3 8B might answer correctly but add a guess like "extensions are available." Command R 7B is more likely to stay within the context boundaries. This is what we see.


August 20, 2025

Prompt engineering

Tried doing a query rewrite, but this is difficult. Reverted back. Got a pretty good result with this question:

"What would the author say about art vs. engineering?"

A prompt that starts with "What would the author say..." or "What does the author say..." leads to higher similarity scores.

Implemented the HyDE rewrite of the prompt and that seems to lead to better results, too.

Prompt comparison

First prompt (research assistant, bulleted list):

"""You are a research assistant. You're given journal snippets (CONTEXT) and
a user query. Your job is NOT to write an essay but to list the best-matching
journal files with a 12 sentence rationale. ..."""

Second prompt (expert research assistant, theme + 10 files):

"""You are an expert research assistant. You are given top-ranked journal
excerpts (CONTEXT) and a user's QUERY. ... Format your answer in two parts:
1. Summary Theme  2. Matching Files (bullet list of 10)..."""

The second prompt provides better responses.

Chunk size experiments

Experimenting with chunking. Using 512 and 10 overlap: 2412 vectors. Tried 512 tokens and 0 overlap. Changed the paragraph separator to "\n\n". The default is "\n\n\n" for some reason.

Reduced chunks to 256 tokens to see if higher similarity scores result. It decreased them a bit. Tried 384 tokens and 40 overlap. The 256 and 25 worked better — restored. Will work on semantic gap with the query.

Embedding model switch

Switched the embedding model to BAAI/bge-large-en-v1.5. It seems to do better, although it requires more time to embed the vector store. Interestingly, the variance of the embedding values is much lower. The distribution is narrower, although the values skew in a different way. There is a broader distribution of clusters in the vectors.


August 17, 2025

Working on the Jupyter notebook to measure stats of the vector store.

Links:


August 14, 2025

Ideas for the document search pipeline:

  • Search by cosine similarity for semantic properties
  • Generate search terms and search by regex — names, specific topics or words

Problem: HuggingFace requires internet connection. Solution: download locally.

HuggingFace caches models at ~/.cache/huggingface/hub/. It will redownload them if forced to or if there is a model update.

Solution: ran first (online), which downloaded to the local directory. Then used local_files_only=True to run offline:

embed_model = HuggingFaceEmbedding(
    cache_folder="./models",
    model_name="all-mpnet-base-v2",
    local_files_only=True,
)

LlamaIndex concepts

  • Nodes: chunks of text (paragraphs, sentences) extracted from documents. Stored in the document store (e.g., SimpleDocumentStore), which keeps track of the original text and metadata.
  • Vector store: stores embeddings of nodes. Each entry corresponds to a node's embedding vector. Query results include node IDs (or metadata) that link back to the original nodes in the document store.
  • Vector store entries are linked to their full content via metadata (e.g., node ID).

August 12, 2025

Want to understand the vector store better:

  • Is it effective? Are queries effective?
  • How many entries are there?
  • Why doesn't it find Katie Hafner, but it does find Jimmy Soni?

Query results are improved with a better prompt. Increased top-k to 50 to give the model more text to draw from. But it hallucinates at the end of longer responses.

The SimilarityPostprocessor with similarity_cutoff=0.78 returned nothing. The similarity scores must be very low.

Performance is difficult to tune. Sometimes the models work and sometimes they don't. Multiple models loaded simultaneously causes issues — use ollama ps and ollama stop MODEL_NAME.


August 10, 2025

Project start

Files made today: build.py, query_topk.py, query.py.

Build a semantic search of journal texts:

  • Ingest all texts and metadata
  • Search and return relevant text and file information

Created .venv environment:

python3 -m venv .venv
pip install llama-index-core llama-index-readers-file \
    llama-index-llms-ollama llama-index-embeddings-huggingface

Ran build.py successfully and generated store. SimpleDirectoryReader stores the filename and file path as metadata.

Model comparison (initial): llama3.1:8B, deepseek-r1:8B, gemma3:1b. Can't get past a fairly trivial query engine right now. These aren't very powerful models. Need to keep testing and see what happens.