NOTES ON nd ssearch project / experiments

    Active files (after Feb 2026 tidy-up):
        build_exp_claude.py - builds vector store with chunking and validation
        query_topk_prompt_engine_v2.py - main query engine (v2 prompt, CLI)
        query_topk_prompt_engine_v3.py - v3: adds cross-encoder re-ranking
        retrieve_raw.py - verbatim chunk retrieval (no LLM)
        query_hybrid_bm25_v4.py - hybrid BM25 + vector query with LLM synthesis
        retrieve_hybrid_raw.py - hybrid verbatim chunk retrieval (no LLM)
        search_keywords.py - keyword search via POS-based term extraction
        run_query.sh - shell wrapper for interactive querying

    Earlier scripts moved to archived/:
        build.py, build_exp.py, query_topk.py, query_catalog.py, query_exp.py,
        query_topk_prompt.py, query_topk_prompt_engine.py, query_topk_prompt_dw.py,
        query_rewrite_hyde.py, query_multitool.py, shared/build.py, shared/query.py,
        vs_metrics.py, claude_diagnostic.py, query_claude_sonnet.py, query_tree.py

    Best performance so far:

    BAAI/bge-large-en-v1.5 embedding model with 256 token chunks and 25 overlap,
    Custom prompt (v2), no prompt rewrite. The embedding is slower than
    all-mpnet-base-v2, but not too bad when generating answers.

    command-r7b as the generating model runs about as quickly as llama3.1:8B, but
    provides results that stick better to the provided context

TO DO:

    1. [DONE] Test v3 (cross-encoder re-ranking) and compare results with v2
       Selected ms-marco-MiniLM-L-12-v2 after testing three models.

    2. [DONE] Verbatim retrieval mode (retrieve_raw.py). Uses
       index.as_retriever() instead of index.as_query_engine() to get
       chunks without LLM synthesis. Re-ranks with the same cross-encoder,
       then outputs raw chunk text with metadata and scores.

    3. [DONE] Keyword search pipeline (search_keywords.py). Extracts
       nouns and adjectives via NLTK POS tagging, then greps data files.
       Complements vector search for exact names, places, dates.

    4. [DONE] BM25 hybrid retrieval (sparse + dense). Two new scripts:
       query_hybrid_bm25_v4.py (with LLM synthesis) and
       retrieve_hybrid_raw.py (verbatim chunks, no LLM). Both run BM25
       (top-20) and vector (top-20) retrievers, merge/deduplicate, then
       cross-encoder re-rank to top-15. Uses llama-index-retrievers-bm25.
    5. Explore query expansion (multiple phrasings, merged retrieval)
    6. Explore different vector store strategies (database)
    7. [DONE] Test ChatGPT API for final LLM generation (instead of local Ollama)
    8. [DONE] Remove API key from this file. Moved to ~/.bashrc as OPENAI_API_KEY.

       The retrieval pipeline (embedding, vector search, cross-encoder re-ranking)
       stays the same. Only the final synthesis LLM changes.

       Steps:
       a. Install the LlamaIndex OpenAI integration:
              pip install llama-index-llms-openai
       b. Set API key as environment variable:
              export OPENAI_API_KEY="sk-..."
          (Or store in a .env file and load with python-dotenv. Do NOT commit
          the key to version control.)
       c. In the query script (v3 or v4), replace the Ollama LLM with OpenAI:

              # Current (local):
              from llama_index.llms.ollama import Ollama
              Settings.llm = Ollama(
                  model="command-r7b",
                  request_timeout=360.0,
                  context_window=8000,
              )

              # New (API):
              from llama_index.llms.openai import OpenAI
              Settings.llm = OpenAI(
                  model="gpt-4o-mini",   # or "gpt-4o" for higher quality
                  temperature=0.1,
              )

       d. Run the query script as usual. Everything else (embedding model,
          vector store, cross-encoder re-ranker, prompt) is unchanged.
       e. Compare output quality and response time against command-r7b.

       Models to try: gpt-4o-mini (cheap, fast), gpt-4o (better quality).
       The prompt (v3 or v4) should work without modification since it's
       model-agnostic -- just context + instructions.

       Note: This adds an external API dependency and per-query cost.
       The embedding and re-ranking remain fully local/offline.

      API KEY: moved to ~/.bashrc as OPENAI_API_KEY (do not store in repo)

       Getting an OpenAI API key:
       a. Go to https://platform.openai.com/ and sign up (or log in).
       b. Navigate to API keys: Settings > API keys (or https://platform.openai.com/api-keys).
       c. Click "Create new secret key", give it a name, and copy it.
          The key starts with "sk-" and is shown only once.
       d. Add billing: Settings > Billing. Load a small amount ($5-10)
          to start. API calls are pay-per-use, not a subscription.
       e. Set the key in your shell before running a query:
              export OPENAI_API_KEY="sk-..."
          Or add to ~/.zshrc (or ~/.bashrc) to persist across sessions.
          Do NOT commit the key to version control or put it in scripts.

       Approximate cost per query (Feb 2026):
          gpt-4o-mini: ~$0.001-0.003 (15 chunks of context)
          gpt-4o:      ~$0.01-0.03

----------------------------------------------------------------------------------------

FEBRUARY 20, 2026

OFFLINE USE: ENVIRONMENT VARIABLES MUST BE SET BEFORE IMPORTS

Despite setting HF_HUB_OFFLINE=1 and SENTENCE_TRANSFORMERS_HOME=./models (added
Feb 16), the scripts still failed offline with a ConnectionError trying to reach
huggingface.co. The error came from AutoTokenizer.from_pretrained() calling
list_repo_templates(), which makes an HTTP request to the HuggingFace API.

Root cause: the huggingface_hub library evaluates HF_HUB_OFFLINE at import time,
not at call time. The constant is set once in huggingface_hub/constants.py:

    HF_HUB_OFFLINE = _is_true(os.environ.get("HF_HUB_OFFLINE")
                               or os.environ.get("TRANSFORMERS_OFFLINE"))

In all four scripts, the os.environ lines came AFTER the imports:

    from llama_index.embeddings.huggingface import HuggingFaceEmbedding  # <-- triggers import of huggingface_hub
    from llama_index.core.postprocessor import SentenceTransformerRerank
    import os

    os.environ["HF_HUB_OFFLINE"] = "1"   # <-- too late, constant already False

By the time os.environ was set, huggingface_hub had already imported and locked
the constant to False. The env var existed in the process environment but the
library never re-read it.

Fix: moved `import os` and all three os.environ calls to the top of each file,
before any llama_index or huggingface imports:

    import os
    os.environ["TOKENIZERS_PARALLELISM"] = "false"
    os.environ["SENTENCE_TRANSFORMERS_HOME"] = "./models"
    os.environ["HF_HUB_OFFLINE"] = "1"

    from llama_index.core import ...          # now these see the env vars
    from llama_index.embeddings.huggingface import HuggingFaceEmbedding

Updated scripts: query_topk_prompt_engine_v3.py, retrieve_raw.py,
query_hybrid_bm25_v4.py, retrieve_hybrid_raw.py.

GENERAL LESSON FOR OFFLINE HUGGINGFACE USE:

The HuggingFace ecosystem has multiple libraries that check for offline mode:
  - huggingface_hub: reads HF_HUB_OFFLINE (or TRANSFORMERS_OFFLINE) at import
  - transformers: delegates to huggingface_hub's constant
  - sentence-transformers: delegates to huggingface_hub's constant

All of them evaluate the flag ONCE at module load time. This means:
  1. os.environ must be set before ANY import that touches huggingface_hub
  2. Setting the env var in a "Globals" section after imports does NOT work
  3. Even indirect imports count -- llama_index.embeddings.huggingface
     transitively imports huggingface_hub, so the flag must precede it
  4. Alternatively, set the env var in the shell before running Python:
         export HF_HUB_OFFLINE=1
     This always works because it's set before any Python code runs.
  5. The newer transformers library (v4.50+) added list_repo_templates() in
     AutoTokenizer.from_pretrained(), which makes network calls that weren't
     present in earlier versions. This is why the Feb 16 fix worked initially
     (or appeared to) but broke after a package update.

This is a common pitfall for anyone running HuggingFace models offline (e.g.,
on a laptop without network, air-gapped environments, or behind restrictive
firewalls). The models are cached locally and work fine -- but the library
still tries to check for updates unless the offline flag is set correctly.

---

INCREMENTAL VECTOR STORE UPDATES

Added incremental update mode to build_exp_claude.py. Previously the script
rebuilt the entire vector store from scratch every run (~1848 files). Now it
defaults to incremental mode: loads the existing index, compares against
./data, and only processes new, modified, or deleted files.

Usage:
    python build_exp_claude.py            # incremental update (default)
    python build_exp_claude.py --rebuild  # full rebuild from scratch

How it works:
  - The LlamaIndex docstore (storage_exp/docstore.json) already tracks every
    indexed document with metadata: file_name, file_size, last_modified_date.
  - The script scans ./data/*.txt and classifies each file:
      New:       file_name not in docstore → insert
      Modified:  file_size or last_modified_date differs → delete + re-insert
      Deleted:   in docstore but not on disk → delete
      Unchanged: skip
  - Uses index.insert() and index.delete_ref_doc() from the LlamaIndex API.
  - The same SentenceSplitter (256 tokens, 25 overlap) is applied via
    Settings.transformations so chunks match the original build.

Timing: incremental update with nothing to do takes ~17s (loading the index).
Full rebuild takes several minutes. First incremental run after a stale index
found 8 new files and 204 modified files, completed in ~65s.

Important detail: SimpleDirectoryReader converts file timestamps to UTC
(datetime.fromtimestamp(mtime, tz=timezone.utc)) before formatting as
YYYY-MM-DD. The comparison logic must use UTC too, or files modified late in
the day will show as "modified" due to the date rolling forward in UTC. This
caused a false-positive bug on the first attempt.

This enables running the build as a cron job to keep the vector store current
as new journal entries are added.

----------------------------------------------------------------------------------------

FEBRUARY 18, 2026

LLM COMPARISON: gpt-4o-mini (OpenAI API) vs command-r7b (local Ollama)

Test query: "Passages that quote Louis Menand." (hybrid BM25+vector, v4)
Retrieval was identical (same 15 chunks, same scores) -- only synthesis differs.
Results saved in tests/results_openai.txt and tests/results_commandr7b.txt.

gpt-4o-mini:
  - Cited 6 files (2025-11-04, 2025-02-14, 2022-08-14, 2025-07-27,
    2025-02-05, 2024-09-04). Drew from chunks ranked as low as #14.
  - Better at distinguishing direct quotes from paraphrases and indirect
    references. Provided a structured summary with numbered entries.
  - 44 seconds total (most of that is local retrieval/re-ranking; the
    API call itself is nearly instant).

command-r7b:
  - Cited 2 files (2025-11-04, 2022-08-14). Focused on the top-scored
    chunks and ignored lower-ranked ones.
  - Pulled out actual quotes verbatim as block quotes -- more useful if
    you want the exact text rather than a summary.
  - 78 seconds total.

Summary: gpt-4o-mini is broader (more sources, better use of the full
context window) and nearly 2x faster. command-r7b is more focused and
reproduces exact quotes. Both correctly identified the core passages.
The quality difference is noticeable but not dramatic -- the retrieval
pipeline does most of the heavy lifting.

TEMPERATURE NOTE: The gpt-4o-mini test used temperature=0.1 (nearly
deterministic). command-r7b via Ollama defaults to temperature=0.8 --
so the two models were tested at very different temperatures, which
may account for some of the stylistic difference.

Temperature guidance for RAG synthesis:
  0.0-0.1  Nearly deterministic. Picks highest-probability tokens.
           Good for factual extraction and consistency. Can "tunnel
           vision" on the most obvious interpretation.
  0.3-0.5  Moderate. More varied phrasing, more willing to draw
           connections across chunks. Good middle ground for this
           pipeline since the prompt already constrains the model
           to use only the provided context.
  0.7-1.0  Creative/varied. Riskier for RAG -- may paraphrase loosely
           or make weaker inferential leaps. Not ideal when you want
           faithfulness to source text.

FOLLOW-UP: temperature=0.3 for both models (same query, same retrieval)

command-r7b at 0.3 (was 0.8): Major improvement. Cited 6 files (was 2).
Drew from lower-ranked chunks including #15. Used the full context window
instead of fixating on top hits. Took 94s (was 78s) due to more output.

gpt-4o-mini at 0.3 (was 0.1): Nearly identical to 0.1 run. Same 6 files,
same structure. Slightly more interpretive phrasing but no meaningful
change. This model is less sensitive to temperature for RAG synthesis.

KEY FINDING FOR ARTICLE: Temperature is a critical but often overlooked
parameter when evaluating the generation stage of a RAG pipeline. In our
tests, a local 7B model (command-r7b) went from citing 2 sources to 6
-- a 3x improvement in context utilization -- simply by lowering
temperature from 0.8 to 0.3. At the higher temperature, the model
"wandered" during generation, focusing on the most salient chunks and
producing repetitive output. At the lower temperature, it methodically
worked through the full context window.

This has implications for RAG evaluation methodology:
  1. When comparing LLMs for RAG synthesis, temperature must be controlled
     across models. Our initial comparison (gpt-4o-mini at 0.1 vs
     command-r7b at 0.8 default) overstated the quality gap between models.
  2. The "right" temperature for RAG is lower than for open-ended
     generation. The prompt and retrieved context already constrain the
     task; high temperature adds noise rather than creativity.
  3. Temperature affects context utilization, not just style. A model that
     appears to "ignore" lower-ranked chunks may simply need a lower
     temperature to attend to them.
  4. At temperature=0.3, a local 7B model and a cloud API model converged
     on similar quality (6 files cited, good coverage, mix of quotes and
     paraphrase). The retrieval pipeline does most of the heavy lifting;
     the generation model's job is to faithfully synthesize what was
     retrieved.

Testing method: Hold retrieval constant (same query, same vector store,
same re-ranker, same top-15 chunks). Vary only the LLM and temperature.
Compare on: number of source files cited, whether lower-ranked chunks
are used, faithfulness to source text, and total query time. Results
saved in tests/ with naming convention results_<model>_t<temp>.txt.

---

Upgraded LlamaIndex from 0.13.1 to 0.14.14 to add OpenAI API support.

Installing llama-index-llms-openai pulled in llama-index-core 0.14.14, which
was incompatible with the existing companion packages (all pinned to <0.14).
Fixed by upgrading all companion packages together:

    pip install --upgrade llama-index-embeddings-huggingface \
        llama-index-readers-file llama-index-llms-ollama \
        llama-index-retrievers-bm25

Final package versions:
    llama-index-core                   0.14.14  (was 0.13.1)
    llama-index-embeddings-huggingface 0.6.1    (was 0.6.0)
    llama-index-llms-ollama            0.9.1    (was 0.7.0)
    llama-index-llms-openai            0.6.18   (new)
    llama-index-readers-file           0.5.6    (was 0.5.0)
    llama-index-retrievers-bm25        0.6.5    (unchanged)
    llama-index-workflows              2.14.2   (was 1.3.0)

Smoke test: retrieve_raw.py "mining towns" -- works, same results as before.
No vector store rebuild needed. The existing storage_exp/ loaded fine with 0.14.

---

Checked whether paragraph_separator="\n\n" in build_exp_claude.py makes sense
for the journal data.

Results from scanning all 1,846 files in ./data/:
  - 1,796 files (97%) use \n\n as paragraph boundaries
  - 28 files use single newlines only
  - 22 files have no newlines at all
  - Average paragraphs per file: 10.8 (median 7, range 0-206)
  - 900 files (49%) also use --- as a topic/section separator

The \n\n setting is correct. SentenceSplitter tries to break at paragraph_separator
boundaries first, then falls back to sentence boundaries, then words. With 256-token
chunks, this keeps semantically related sentences together within a paragraph.

The --- separators are already surrounded by \n\n (e.g., \n\n---\n\n), so they
naturally act as break points too. No special handling needed.

Note: "\n\n" is actually the default value for paragraph_separator in LlamaIndex's
SentenceSplitter. The explicit setting documents intent but is functionally redundant.

List-style entries with single newlines between items (e.g., 2001-09-14.txt) stay
together within a chunk, which is desirable -- lists shouldn't be split line by line.


FEBRUARY 16, 2026

Cached the cross-encoder model (cross-encoder/ms-marco-MiniLM-L-12-v2) in
./models/ for offline use. Previously, HuggingFaceEmbedding already used
cache_folder="./models" with local_files_only=True for the embedding model,
but the cross-encoder (loaded via SentenceTransformerRerank → CrossEncoder)
had no cache_folder parameter and would fail offline when it tried to phone
home for updates.

Fix: all scripts that use the cross-encoder now set two environment variables
before model initialization:
    os.environ["SENTENCE_TRANSFORMERS_HOME"] = "./models"
    os.environ["HF_HUB_OFFLINE"] = "1"
SENTENCE_TRANSFORMERS_HOME directs the CrossEncoder to look in ./models/ for
cached weights. HF_HUB_OFFLINE prevents any network access attempt.

The model was cached using huggingface_hub.snapshot_download():
    from huggingface_hub import snapshot_download
    snapshot_download('cross-encoder/ms-marco-MiniLM-L-12-v2', cache_dir='./models')

Models now in ./models/:
    models--BAAI--bge-large-en-v1.5                  (embedding, bi-encoder)
    models--cross-encoder--ms-marco-MiniLM-L-12-v2   (re-ranker, cross-encoder)
    models--sentence-transformers--all-mpnet-base-v2  (old embedding, kept)

Updated scripts: query_topk_prompt_engine_v3.py, retrieve_raw.py,
query_hybrid_bm25_v4.py, retrieve_hybrid_raw.py.


FEBRUARY 15, 2026

DESIGN NOTE on search_keywords.py (backburnered):

The POS tagger has a fundamental limitation: it was trained on declarative
prose, not imperative queries. A query like "Find passages that mention Louis
Menand" causes the tagger to classify "find" and "mention" as nouns (NN)
rather than verbs, because the imperative sentence structure is unusual in
its training data. This floods results with false positives (304 matches
across 218 files instead of the handful mentioning Menand).

More fundamentally: for term-based searches, the POS tagging layer adds
minimal value over bare grep. If the input is "Louis Menand", POS tagging
extracts "louis menand" -- identical to what grep would match. The tool's
real value is not the NLP layer but the convenience wrapper: searching all
files at once, joining multi-word proper nouns, sorting by match count, and
showing context around matches. It's essentially a formatted multi-file grep.

Possible future direction: merge keyword search results with semantic search
results. The keyword pipeline catches exact names, places, and dates that
embeddings miss, while the semantic pipeline catches thematic relevance that
keywords miss. A hybrid approach could combine both result sets, using keyword
matches to boost or supplement vector retrieval. This connects to the BM25
hybrid retrieval idea (TODO item 4).

NEW SCRIPTS: query_hybrid_bm25_v4.py and retrieve_hybrid_raw.py

Implemented BM25 hybrid retrieval (TODO item 4). Both scripts run two
retrievers in parallel on the same query:
  - Vector retriever: top-20 by cosine similarity (semantic meaning)
  - BM25 retriever: top-20 by term frequency (exact lexical matching)
Results are merged and deduplicated by node ID, then passed to the
cross-encoder re-ranker (ms-marco-MiniLM-L-12-v2) -> top-15.

query_hybrid_bm25_v4.py feeds the re-ranked chunks to the LLM (same v3
prompt and command-r7b model). retrieve_hybrid_raw.py outputs the raw
chunks with source annotations: [vector-only], [bm25-only], or
[vector+bm25], showing which retriever nominated each result.

The BM25 retriever uses BM25Retriever.from_defaults(index=index) from
llama-index-retrievers-bm25 (v0.6.5). It indexes the nodes already
stored in the persisted vector store -- no separate build step needed.

Key idea: BM25's job is only to nominate candidates that vector similarity
might miss (exact names, dates, specific terms). The cross-encoder decides
final relevance regardless of where candidates came from.


FEBRUARY 12, 2026

Updated vector store, now 4816 chunks?

Scope of a language model based search: LLMs can summarize, but lack the ability to critically read and compare information. ChatGPT can summarize the literature that I've cited, but it cannot critique it. (It could generate from published critiques.) Our ability to critically read and synthesize from literature is an imporant skill. (Most reviews fall far short, simply aggregating "advances" without asking why, how, or whether they are real or not.)


FEBRUARY 11, 2026

Tidied up the project with Claude Code:
- Generated README.md and claude.md documentation
- Archived superseded scripts (v1 query engines, old build scripts, shared/,
  experimental/query_multitool.py)
- Removed stale storage_exp copy/ (Aug 2025 backup, ~105 MB)
- Removed empty shared/ and experimental/ directories

Created query_topk_prompt_engine_v3.py: adds cross-encoder re-ranking.

The idea: the current pipeline (v2) uses a bi-encoder (BAAI/bge-large-en-v1.5)
that encodes query and chunks independently, then compares via cosine similarity.
This is fast but approximate -- the query and chunk never "see" each other.

A cross-encoder takes the query and chunk as a single concatenated input, with
full attention between all tokens. It scores the pair jointly, which captures
nuance that dot-product similarity misses (paraphrase, negation, indirect
relevance). The tradeoff is speed: you can't pre-compute scores.

v3 uses a two-stage approach:
  1. Retrieve top-30 via bi-encoder (fast, approximate)
  2. Re-rank to top-15 with cross-encoder (slow, precise)
  3. Pass re-ranked chunks to LLM for synthesis

Cross-encoder model: cross-encoder/ms-marco-MiniLM-L-6-v2 (~80 MB, 6 layers).
Trained on MS MARCO passage ranking. Should add only a few seconds to query time
for 30 candidates.

Next steps to explore:
- BM25 hybrid: combine sparse keyword retrieval (BM25) with dense vector search.
  This would help with exact term matching (names, places, dates) that embeddings
  can miss. More principled than the regex approach in query_multitool.py.
- Query expansion: generate 2-3 alternate phrasings of the query, retrieve for
  each, merge results. Different from HyDE (which generated fake answers).
  Bridges vocabulary gaps without hallucination risk.

NOTES FROM CLAUDE:

The current pipeline (v2) uses a bi-encoder for retrieval. The re-ranker adds a cross-encoder as a second pass. Here's the difference:

Bi-encoder (what you have now)

Your embedding model (BAAI/bge-large-en-v1.5) encodes the query and each chunk independently into vectors. Similarity is a dot product between two vectors that were computed separately. This is fast -- you can pre-compute all chunk vectors once at build time and just compare against the query vector at search time. But because query and chunk never "see" each other during encoding, the model can miss subtle relevance signals. It's matching shapes in vector space, not reading the query and passage together.

Cross-encoder (what v3 adds)

A cross-encoder takes the query and a chunk as a single input pair: [query, chunk] concatenated together. It reads both simultaneously through the transformer, with full attention between every token in the query and every token in the chunk. It outputs a single relevance score. This is much more accurate because the model can reason about the specific relationship between your question and the passage -- word overlap, paraphrase, negation, context.

The tradeoff: it's slow. You can't pre-compute anything because the score depends on the specific query. Scoring 4,692 chunks this way would take too long.

Why the two-stage approach works

4,692 chunks  ->  bi-encoder (fast, approximate)  ->  top 30
    top 30    ->  cross-encoder (slow, precise)    ->  top 15
    top 15    ->  LLM synthesis                    ->  response
The bi-encoder acts as a fast filter to get a manageable candidate set. The cross-encoder then does careful, pairwise scoring on just those 30 candidates. You get the speed of vector search with the accuracy of joint scoring.

Concrete example: If you search "times the author felt conflicted about career choices," the bi-encoder might rank a chunk about "job satisfaction" highly because the vectors are close. But a chunk that says "I couldn't decide whether to stay or leave" -- without using the word "career" -- might score lower in vector space. The cross-encoder, reading both query and chunk together, would recognize that "couldn't decide whether to stay or leave" is highly relevant to "felt conflicted about career choices."

The model we're using (cross-encoder/ms-marco-MiniLM-L-6-v2) is small (~80 MB, 6 transformer layers) and was trained on the MS MARCO passage ranking dataset. Scoring 30 chunks should add only a few seconds to query time.

***

When I tried running, it failed because I was offline. (Requires the cross-encoder model.)

Questions:
    Why ms-marco-MiniLM-L-6-v2 as the cross-encoder?
    What does a chunk look like? How much text?
    Am I pushing the limit of the context length for BAAI/bge-large-en-v1.5?

***

Updated the v3 prompt to account for re-ranked context. The v2 prompt was carried
over unchanged but didn't reflect that every chunk has now been vetted by the
cross-encoder. Changes:
- Tells the LLM the context is from a "personal journal collection" and has been
  "selected and ranked for relevance"
- "Examine ALL provided excerpts, not just the top few" -- counters single-file
  collapse seen in initial testing
- "When multiple files touch on the query, note what each one contributes" --
  encourages breadth across sources
- "End with a list of all files that contributed" -- stronger than v2's vague
  "list all relevant source files"

Also updated run_query.sh to point to v3.

TEST RESULTS (v3 with new prompt):

Query: "Passages that describe mining towns."
- Response cited 2 passages from 2023-03-15.txt (coal mining, great-grandfather)
- Source documents included 7 distinct files across 15 chunks
- Top cross-encoder score: -1.177 (2025-09-14.txt)
- LLM focused on 2023-03-15.txt which had the most explicit mining content
- Query time: 76 seconds
- Note: cross-encoder scores are raw logits (negative), not 0-1 cosine similarity

Query: "I am looking for entries that discuss memes and cognition."
- Response cited 6 distinct files with specific content from each:
  2025-07-14 (Dennett/Blackmore on memes), 2023-09-20 (Hurley model),
  2024-03-24 (multiple drafts model), 2021-04-25 (consciousness discussion),
  2026-01-08 (epistemological frameworks), 2025-03-10 (Extended Mind Theory)
- Top cross-encoder score: 4.499 (2026-01-08.txt) -- clear separation from rest
- LLM drew from chunks ranked 3rd, 4th, 5th, 12th, and 15th -- confirming it
  examines the full context, not just top hits
- Query time: 71 seconds

Observations:
- The v3 prompt produces much better multi-source synthesis than v2's prompt did
- Cross-encoder scores show clear separation between strong and weak matches
- The re-ranker + new prompt together encourage breadth across files
- Query time comparable to v2 (~70-80 seconds)

CROSS-ENCODER MODEL COMPARISON:

Tested three cross-encoder models on the same query ("Discussions of Kondiaronk
and the Wendats") to compare re-ranking behavior.

1. cross-encoder/ms-marco-MiniLM-L-12-v2 (baseline)
   - Scores: raw logits, wide spread (top score 3.702)
   - Clear separation between strong and weak matches
   - Balanced ranking: 2025-06-07.txt #1, 2025-07-28.txt #2, 2024-12-25.txt #3
   - Query time: ~70-80 seconds
   - Trained on MS MARCO passage ranking (query → relevant passage)

2. cross-encoder/stsb-roberta-base
   - Scores: 0.308 to 0.507 -- very compressed range (0.199 spread)
   - Poor differentiation: model can't clearly separate relevant from irrelevant
   - Pulled in 2019-07-03.txt at #2 (not in L-12 results), dropped 2024-12-25.txt
   - Query time: 92 seconds
   - Trained on STS Benchmark (semantic similarity, not passage ranking) --
     wrong task for re-ranking. Measures "are these texts about the same thing?"
     rather than "is this passage a good answer to this query?"

3. BAAI/bge-reranker-v2-m3
   - Scores: calibrated probabilities (0-1). Sharp top (0.812), then 0.313, 0.262...
     Bottom 6 chunks at 0.001 (model says: not relevant at all)
   - Very confident about #1 (2025-07-28.txt at 0.812), but long zero tail
   - Generated text quality felt slightly better (may be LLM variance)
   - 5 of 15 chunks from 2025-07-28.txt -- heavy concentration on one file
   - Query time: 125 seconds (50% slower than L-12)
   - Multilingual model, larger than ms-marco MiniLM variants

Summary:
                            Score spread    Speed       Differentiation
  ms-marco-MiniLM-L-12-v2  Wide (logits)   ~70-80s     Good, balanced
  BAAI/bge-reranker-v2-m3  Sharp top/zeros  ~125s      Confident #1, weak tail
  stsb-roberta-base         Compressed       ~92s       Poor

Decision: ms-marco-MiniLM-L-12-v2 is the best fit. Purpose-built for passage
ranking, fastest of the three, and produces balanced rankings with good score
separation. The BAAI model's zero-tail problem means 6 of 15 chunks are dead
weight in the context window (could be mitigated by lowering RERANK_TOP_N or
adding a score cutoff, but adds complexity for marginal gain). The stsb model
is simply wrong for this task -- semantic similarity ≠ passage relevance.

NEW SCRIPTS: retrieve_raw.py and search_keywords.py

retrieve_raw.py -- Verbatim chunk retrieval, no LLM. Uses the LlamaIndex
retriever API instead of the query engine:

    # v3 uses as_query_engine() -- full pipeline including LLM synthesis
    query_engine = index.as_query_engine(
        similarity_top_k=30,
        text_qa_template=PROMPT,
        node_postprocessors=[reranker],
    )
    response = query_engine.query(q)   # returns LLM-generated text

    # retrieve_raw.py uses as_retriever() -- stops after retrieval
    retriever = index.as_retriever(similarity_top_k=30)
    nodes = retriever.retrieve(q)       # returns raw NodeWithScore objects
    reranked = reranker.postprocess_nodes(nodes, query_str=q)

The key distinction: as_query_engine() wraps retrieval + synthesis into one
call (retriever → node postprocessors → response synthesizer → LLM).
as_retriever() returns just the retriever component, giving back the raw
nodes with their text and metadata. The re-ranker's postprocess_nodes()
method can still be called manually on the retrieved nodes.

Each node has:
  - node.get_content()  -- the chunk text
  - node.metadata       -- dict with file_name, file_path, etc.
  - node.score          -- similarity or re-ranker score

This separation is useful for inspecting what the pipeline retrieves before
the LLM processes it, and for building alternative output formats.

search_keywords.py -- Keyword search via NLTK POS tagging. Completely
separate from the vector store pipeline. Extracts nouns (NN, NNS, NNP, NNPS)
and adjectives (JJ, JJR, JJS) from the query using nltk.pos_tag(), then
searches ./data/*.txt with regex. Catches exact terms that embeddings miss.
NLTK data (punkt_tab, averaged_perceptron_tagger_eng) is auto-downloaded on
first run.


----------------------------------------------------------------------------------------

JANUARY 12, 2026

Best Practices for Query Rewriting

1. Understand the Original Intent:
First off, it helps to clarify the core intent behind your user’s query. Sometimes that means expanding a terse question into a more descriptive one, or breaking a complex query into smaller, more focused sub-queries. This makes it easier for the vector store to match what you really need.

2. Leverage LlamaIndex’s Built-In Rewriting Tools:
LlamaIndex does have some query transformation utilities that can help automatically rephrase or enrich queries. If you’re finding them a bit limited, you can definitely use them as a starting point and then tweak the results.

3. Using a Model to Generate Rewrites:
Yes, another solid approach is to have a language model like GPT (or whichever model you’re comfortable with) generate a “clarified” version of the query. Essentially, you feed the model the initial query and ask it to rephrase or add context. This can help the vector store understand the query better and return more relevant results.

Step-by-Step Approach
	•	Initial Query Expansion: Try taking the raw user query and expanding it yourself first, just to add some natural language context.
	•	Model-Assisted Rewriting: If that’s not enough, use a model to generate a few alternate phrasings. You can prompt the model with something like, “Please rewrite this query in a more detailed form for better retrieval results.”
	•	Testing and Iteration: Once you have a few rewritten versions, test them out and see which ones yield the best matches. Over time, you’ll get a sense of which transformations work best for your specific data.

In short, you can definitely use the model to help generate more robust queries and fine-tune from there. It’s a mix of automation and a little human tweaking to get things just right!

JANUARY 1, 2026

Updated storage_exp by running build_exp.py


SEPTEMBER 6, 2025

For ssearch:
Rebuilt storage_exp
2048 embeddings
Took about 4 minutes

Need to experiment more with query rewrites. Save the query but match on extracted terms?
You can imagine an agent that decides betewen a search like grep and a more semantic search.
The search is not good at finding dates (What did the author say on DATE) or when searching for certain terms (What did the author say about libraries?) 


AUGUST 28, 2025 

Idea: given a strong (or top) hit, use this node to find similar chunks.

Working with demo. I saved 294 emails from president@udel.edu. Embedding these took some time! Nearly 45 minutes. The resutling vector store is larger than my journals. The search is ok, but this could be optimized by stripping the headers.

To make the text files, I used:

$ textutil -convert txt *.eml 

The resulting text is
145204 lines, 335690 words, and 9425696 characters total. About 9.4MB of text.

$ python build.py 
Parsing nodes: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 294/294 [00:31<00:00,  9.28it/s]
Generating embeddings: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2048/2048 [01:57<00:00, 17.49it/s]
Generating embeddings: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2048/2048 [01:56<00:00, 17.57it/s]
Generating embeddings: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2048/2048 [02:20<00:00, 14.55it/s]
Generating embeddings: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2048/2048 [01:47<00:00, 19.00it/s]
Generating embeddings: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2048/2048 [01:42<00:00, 20.02it/s]
Generating embeddings: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2048/2048 [01:41<00:00, 20.23it/s]
Generating embeddings: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2048/2048 [01:41<00:00, 20.18it/s]
Generating embeddings: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2048/2048 [01:40<00:00, 20.34it/s]
Generating embeddings: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2048/2048 [01:40<00:00, 20.43it/s]
Generating embeddings: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2048/2048 [01:38<00:00, 20.69it/s]
Generating embeddings: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2048/2048 [01:37<00:00, 21.02it/s]
Generating embeddings: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2048/2048 [04:39<00:00,  7.33it/s]
Generating embeddings: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2048/2048 [01:44<00:00, 19.62it/s]
Generating embeddings: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2048/2048 [01:56<00:00, 17.52it/s]
Generating embeddings: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2048/2048 [01:57<00:00, 17.48it/s]
Generating embeddings: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2048/2048 [02:01<00:00, 16.88it/s]
Generating embeddings: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2048/2048 [01:58<00:00, 17.26it/s]
Generating embeddings: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2048/2048 [01:56<00:00, 17.62it/s]
Generating embeddings: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2048/2048 [02:02<00:00, 16.70it/s]
Generating embeddings: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 22/22 [00:02<00:00, 10.44it/s]

Took a Total = 2,571 seconds = 42 minutes 51 seconds.
Results in a vector store:

$ ls -lh storage/
total 2014288
-rw-r--r--@ 1 furst  staff   867M Aug 28 10:58 default__vector_store.json
-rw-r--r--@ 1 furst  staff   100M Aug 28 10:57 docstore.json
-rw-r--r--@ 1 furst  staff    18B Aug 28 10:57 graph_store.json
-rw-r--r--@ 1 furst  staff    72B Aug 28 10:58 image__vector_store.json
-rw-r--r--@ 1 furst  staff   3.1M Aug 28 10:57 index_store.json

That's a big vector store!

The journals have a vector store that is only 90M (an order of magnitude smaller) from a body of texts that is (by wc): 43525 lines,  469246 "words", and 2786938 characters total. (About 3MB of text.)

OK, I had help writing a script to just extract the text (or html) from the eml files. 
The resulting text files are:
  21313 lines  130901 words  946474 characters total
so we've dropped down considerably!

Running build.py...

Note that the number of nodes is the number of files.
Each node must contain mutiple embedded vectors?

This time it took about 1:15 to generate the vector store.

$ ls -lh storage
total 58608
-rw-r--r--@ 1 furst  staff    25M Aug 28 11:36 default__vector_store.json
-rw-r--r--@ 1 furst  staff   3.5M Aug 28 11:35 docstore.json
-rw-r--r--@ 1 furst  staff    18B Aug 28 11:35 graph_store.json
-rw-r--r--@ 1 furst  staff    72B Aug 28 11:36 image__vector_store.json
-rw-r--r--@ 1 furst  staff    92K Aug 28 11:35 index_store.json


AUGUST 27, 2025

The wrapped query works great on the decwriter!
Queries take about 83 seconds, and sometimes up to 95 seconds if the model needs to be loaded.
Longest query so far (had to load all models) is 98 seconds.


AUGUST 26, 2025

- Started an "experimental" folder for combining semantic + LLM-guided regex search.
- Created an "archive" folder for older versions.
- Wrote a shell script wrapper and a version that takes input on the command line

Timed the retrieval, but backup was running, so it was probably longer.

real    1m20.971s
user    0m13.074s
sys     0m1.429s

AUGUST 25, 2025

- I want to build a bash wrapper around the python query engine. The bash wrapper would handle input and output.

- Expand the search to extract keywords and do a regex search on those. Can you search the real text chunks and sort by a similarity calc? But similarity with what?

- What if you returned more results and sorted these by a cluster grouping?

AUGUST 21, 2025

HyDE stands for Hypothetial Document Embeddings.

I took out HyDE to test generation. Not sure HyDE is doing anything?
Indeed, it is not generating results that are any better or different than just uing the BAAI/bge-large-en-v1.5 embedding model and a custom prompt. The BAAI/bge-large-en-1.5 gives very good results!

Compared llama3.1:8B with command-r7b. Both are about the same size and give similar results. ChatG is pretty adamant that command-r7b will stick more to the retrieved content. This is actually reinforced by the following exercise:

command-r7b output:

You are a helpful assistant. Using ONLY the context below, answer the user's question.  Context: - The filing deadline for the 2023 tax year is April 15th, 2024. - Late filings are subject to a 5% penalty per month, up to 25%.  Question: When is the last day I can file my 2023 taxes without penalty?
edit
The last day you can file your 2023 taxes without incurring any penalties is April 15th, 2024. This is the official filing deadline for the 2023 tax year. Filing after this date will result in a late fee, with a 5% penalty per month up to a maximum of 25%.

llama3.1:7b output: 

You are a helpful assistant. Using ONLY the context below, answer the user's question.  Context: - The filing deadline for the 2023 tax year is April 15th, 2024. - Late filings are subject to a 5% penalty per month, up to 25%.  Question: When is the last day I can file my 2023 taxes without penalty?
edit
April 15th, 2024.

Note: The context only mentions the filing deadline and late fees, not any possible extensions or exceptions.

ChatG says: LLaMA 3 8B might answer correctly but add a guess like “extensions are available.” Command R 7B is more likely to stay within the context boundaries. Interestingly, this is what we see.


AUGUST 20, 2025

Tried doing a query rewrite, but this is difficult. I reverted back. I got a pretty good result with this question:

Enter your question (or 'exit'): What would the author say about art vs. engineering?

**Summary Theme**
The dominant theme from the relevant CONTEXT is that there are valuable connections between artistic and technical pursuits, and that engineers can benefit from embracing multidisciplinary approaches to their work.

**Matching Files**

• **2021-11-24.txt** — This file discusses how the author's choice of pursuing academic administration instead of art raised their esteem but also legitimized their position in a field where they were not recognized for artistic pursuits. It highlights the importance of embracing multidisciplinary approaches.
• **2025-02-10.txt** — This file explicitly states that engineers who cultivate an artistic and humanistic practice often develop improved communication skills, enhanced creativity, increased empathy, and broader perspectives.
• **2024-09-04.txt** — Although this file mentions a science fiction writer's article on AI creating art, it also touches on the idea of artists exploring technological tools to express themselves creatively.
• **2022-01-03.txt** — This file discusses artistic practice through computer programming and coding, highlighting the intersection of art and technology.
• **2024-05-20.txt** — The author reflects on their past experience with art and how it compares to their current analytical work, showing a continued interest in exploring both technical and creative pursuits.
• **2021-11-23.txt** — This file is similar to 2021-11-24.txt but provides more insight into the author's decision-making process regarding artistic versus administrative pursuits.
• **2024-04-28.txt** — The text by Molnar discusses how computers free artists from a sclerotic legacy and enable the production of unimaginable pictures, highlighting the potential for technological tools to enhance artistic expression.
• **2025-02-10.txt** — This file also mentions that engineers who engage with art and humanities develop increased empathy and are more well-rounded individuals, further supporting the theme of interdisciplinary connections.
• **2022-01-01.txt** — The discussion on computer art in New Mexico touches on how technology was used to analyze physical problems using numerical calculations, showing a historical connection between art and technical pursuits.
• **2025-02-10.txt** — This file's fourth point lists "broader perspectives" as one of the benefits engineers gain from exploring various disciplines.

Source documents:
2025-02-10.txt --- /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/data/2025-02-10.txt 0.6828949479536661
2021-11-24.txt --- /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/data/2021-11-24.txt 0.6798298755248631
2025-02-10.txt --- /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/data/2025-02-10.txt 0.661207761457756
2025-02-10.txt --- /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/data/2025-02-10.txt 0.591521102062024
2024-09-04.txt --- /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/data/2024-09-04.txt 0.5321669217118556
2022-01-03.txt --- /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/data/2022-01-03.txt 0.5280481502588477
2024-05-20.txt --- /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/data/2024-05-20.txt 0.5234369759566861
2024-04-28.txt --- /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/data/2024-04-28.txt 0.5129689845187633
2022-01-01.txt --- /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/data/2022-01-01.txt 0.512206687244374
2021-11-23.txt --- /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/data/2021-11-23.txt 0.5086899123840841

A prompt that starts with "What would the author say..." or "What does the author say..." leads to higher similarity scores. Interesting!

I implemented the HyDE rewrite of the prompt and that seems to lead to better results, too.

AUGUST 19, 2025

Started experimenting with chunking. Using 512 and 10 overlap, I get 2412 vectors. 

Trying 512 tokens and 0 overlap. I changed the paragraph separator to "\n\n". The default is "\n\n\n" for some reason.

This is the prompt I'm using:

PROMPT = PromptTemplate(
"""You are a research assistant. You’re given journal snippets (CONTEXT) and a user query.
Your job is NOT to write an essay but to list the best-matching journal files with a 1–2 sentence rationale.

Rules:
- Use only the CONTEXT; do not invent content.
- Prefer precise references to passages over generalities.
- Output exactly:
  1) A brief one-line summary of the overall theme you detect.
  2) A bulleted list: **filename** — brief rationale. If available in the snippet, include date or section hints.

CONTEXT:
{context_str}

QUERY: {query_str}

Now produce the summary and the bulleted list of matching files."""
)

This one provides better responses:

PROMPT = PromptTemplate(
"""You are an expert research assistant. You are given top-ranked journal excerpts (CONTEXT) and a user’s QUERY.

Instructions:
- Base your response *only* on the CONTEXT.
- The snippets are ordered from most to least relevant—prioritize insights from earlier (higher-ranked) snippets.
- Aim to reference *as many distinct* relevant files as possible (up to 10).
- Do not invent or generalize; refer to specific passages or facts only.
- If a passage only loosely matches, deprioritize it.

Format your answer in two parts:

1. **Summary Theme**  
   Summarize the dominant theme from the relevant CONTEXT.

2. **Matching Files**  
   Make a bullet list of 10. The format for each should be:  
   • **<filename>** — <rationale tied to content. Include date or section hints if available.>

CONTEXT:
{context_str}

QUERY:
{query_str}

Now provide the theme and list of matching files."""
)

I'm reducing chunks to 256 tokens to see if I can get higher similarity scores. It decreased them a bit. Try 384 tokens. Try some overlap, too? I'll do that first.

Troubleshooting: "In part, it may be the low similarity between my query (as query text) and actual chunks or sentences in the text itself."

ChatG response: "You’re spot on — if your queries are phrased differently from the way information is expressed in your chunks, that’s a core cause of low similarity scores in vector search. This is especially common when:
	•	The source text is narrative, formal, or unstructured (e.g. journals, reports, logs).
	•	Queries are abstract or conversational (e.g. “What did we learn about X?” vs. the document saying “We ran A/B tests on X and saw…”).
	•	The embedding model lacks domain-specific nuance.

This is a classic semantic gap problem — and there are several techniques to bridge it."

The 256 and 25 returned pretty good results, if not higher similarity scores.
Now I try chunks of 384 and overlap 40.
I think the 256 and 25 worked better -- I  restored. I will work on semantic gap with the query. 

It gives some "advice":

1. reformulate the query with a call to the llm (something I considered)
2. use Querytransfomer to expand to related terms or alternate phrasings

---

I switched the embedding model to BAAI/bge-large-en-v1.5. It seems to do better, although it requires more time to embed the vector store. Interestingly, the variance of the embedding values is much lower. The distribution is narrower, although the values skew in a different way. There is a broader distribution of clusters in the vectors. 

TRY NEXT:

1. I wonder if the HyDE is doing anything? Should I go back to top-k with the BGE embeddings and see how that works, along with the custom prompt?

2. I'd like to combine a semantic search with another tool that extracts specific search terms and runs a more conventional (if a bit fuzzy) search using those.


AUGUST 17, 2025

I've been working on the jupyter notebook to measure stats of the vector store. 

https://docs.llamaindex.ai/en/stable/understanding/putting_it_all_together/q_and_a/#summarization

I'm going to try and build a summarizaiton query.

Prompt Chaining - name for thoughts about cleaning / interpretting prompt

Follow up on llamaindex:

Querying - 
    https://docs.llamaindex.ai/en/stable/understanding/querying/querying/

Indexing - 
    https://docs.llamaindex.ai/en/stable/understanding/indexing/indexing/

AIP reference - 

    https://docs.llamaindex.ai/en/stable/api_reference/


AUGUST 14, 2025

Ideas for the document search / pipeline
- search by cosine similarity for semantic properties
- generate search terms and search by regex -- names, specific topics or words

Problem: Huggingface requires internet connection
Solution: download locally using 

# Install the transformers library if not already installed (EMF: alread installed 4.55.0)
pip install transformers

# Download the model to a local directory
mkdir -p /path/to/local/models
transformers-cli download --model-id sentence-transformers/all-mpnet-base-v2 --local-dir /path/to/local/models

So... huggingface caches the models at
~/.cache/huggingface/hub/
It will redownload them if forced to or if there is a model update.

Throws an error... but I can see the parameters passed to SentenceTransformer:

--> 160 model = SentenceTransformer(
    161     model_name,
    162     device=device,
    163     cache_folder=cache_folder,
    164     trust_remote_code=trust_remote_code,
    165     prompts={
    166         "query": query_instruction
    167         or get_query_instruct_for_model_name(model_name),
    168         "text": text_instruction
    169         or get_text_instruct_for_model_name(model_name),
    170     },
    171     **model_kwargs,
    172 )

SOLUTION:

I ran first and it downloaded to the local directory.
Then I could use local_files_only=True and run offline 

embed_model = HuggingFaceEmbedding(cache_folder="./models",model_name="all-mpnet-base-v2",local_files_only=True)

---

So, for indexllama:
 
nodes: are chunks of text (e.g., paragraphs, sentences) extracted from documents. They are the raw text units that are indexed and queried. Nodes are stored in the document store (e.g., SimpleDocumentStore, ChromaDocumentStore), which keeps track of the original text and metadata.

vector store: stores embeddings of nodes. Each entry in the vector store corresponds to a node's embedding vector. When you query the vector store, the results include node IDs (or metadata) that link back to the original nodes in the document store.

Vector store entries are linked to their full content via metadata (e.g., node ID)


AUGUST 12, 2025

- Want to understand the vector store better.
- Is it effective? Are my queries effective?
- How many entries are there in the vector store?
- What are they?
- Why doesn't it find Katie Hafner, but it does find Jimmy Soni?

Here is one that works. It's not a query that is looking for a specific name, but something that is about a more general concept. But notice that it hallucinates at the end!

Enter your question (or 'exit'): Where does the text mention the author's interst in the physical sciences?

The text mentions the author's interest in the physical sciences in various places. For example, they express fascination with globular clusters when they were younger. They also discuss their conversation with John Mather about measurements of gravity and time. Additionally, they mention their own work in the field, citing papers and research projects. The author's interest in physics is further highlighted by their reflection on the authors that have profoundly affected their thinking, which includes several physicists like David Graeber and Robert Wright.


AUGUST 10, 2025

Files made today:
    build.py
    query_topk.py
    query.py


Build a semantic search of nd texts.

- Ingest all texts and metadata
- Search and return relevant text and file information

Created .venv environment

$ python3 -m venv .venv
$ pip install llama-index-core llama-index-readers-file llama-index-llms-ollama llama-index-embeddings-huggingface

Ran build.py successfully and generated store

SimpleDirectoryReader is supposed to store the filename and file path as metadata.

This is an interesting query:

Enter your question (or 'exit'): What kinds of documents are in the database?

The database appears to contain a wide range of text files, including articles, blog posts, personal notes, and other written content. Some examples include essays about digital preservation and archiving, technical descriptions of old computer hardware, book reviews, and even what seems to be a journal entry or diary passage. The diversity of the texts suggests that they may be collected from various sources, such as online articles, social media posts, personal websites, or even printed materials scanned or photographed for digitization.

This is with llama3.1:8B.

With deepseek-r1:8B, the model shows its thinking:

Enter your question (or 'exit'): What kinds of documents are in the database?

<think>
Okay, let me try to figure out how to approach this question based on the provided context. The user is asking about the types of documents stored in a database, but I don't have direct access to their dataset or any prior knowledge beyond what's given. 

First, looking through all the context files listed here. There are several entries with different file paths and dates, each containing some text snippets. For example, there's one titled "The Transformissions of the Archive" discussing digital preservation challenges. Another mentions reading books like "The Theory and Craft of Digital Preservation," which likely contains academic documents or articles related to that topic.

Then there's a list from 2015 about the digital dark age, comparing physical media versus digital archivability. There's also personal anecdotes from May 6th, 1998, talking about old hardware and software, possibly including documentation on technology history. 

Additionally, in the May 10th entry, the person mentions books they've read or are working on, some of which might be available as documents like PDFs or e-books. There's a mention of terms related to services (Terms of Service, Privacy Policy) that could be part of legal documents stored elsewhere.

The user also has text for projects, including updated policies and agreements, which again fall under legal documents. The final part talks about personal experiences but doesn't directly reference database content except in the context of "text for projects" being part of a collection they're working on.

Putting this together, without referencing any specific files or parts from the context, I can infer that the database includes various types of documents related to digital preservation theories and practices, possibly academic papers, books, articles. Also, it might contain legal documents like terms of service agreements, privacy policies, etc., and perhaps other technical documentation about technology history mentioned in their projects.
</think>
The kinds of documents in the database are those that pertain to the topics discussed or listed above, such as digital preservation theories and practices, academic works on internet pioneers, misinformation age guides, books related to science fiction (like Animorphs), personal anecdotes, project-related texts including terms of service and privacy policy, and other technical or historical documents.

The results of gemma3:1b:

Enter your question (or 'exit'): What kinds of documents are in the database?

The database contains documents related to various topics, including:

*   **Digital Preservation:** Books about digital preservation, including Robert Lucky’s “Silicon Dreams” and Matthew Kirschenbaum’s “Mechanisms.”
*   **Computer History:** Documents about the history of computing, such as Trever Owens’ “The Theory and Craft of Digital Preservation,” David Helfand’s “A Survival Guide to the Misinformation Age,” and Cailin O’Connor and James Owen Weatherall’s “The Misinformation Age.”
*   **Digital Media & Archiving:** Documents related to the archive of digital media, including the “Long Now Foundation’s archive of every Animorphs book” and the “secret glade” mentioned in the context.
*   **Music & Audio:** A collection of music, including “The Skinhead Hamlet,” “The Anarchists Cookbook,” “Live at Third Ear” by Bardo Pond, “Simple Pleasures” by Bobby McFerrin, and a recording of the “Heart of the Ocean” (a pearl).
*   **Personal Experiences & Observations:**  A range of documents including notes from the author, accounts of a flat, observations about the Delaware’s weather, and a description of the lab’s state of disrepair.

*** 

Overall, I can't get past a fairly trivial query engine right now. Keep testing and see what happens. These aren't very powerful models. Perhaps others will give better results?

***

This is an intersting result (below). I increased the top-k to 50. It has much more text to draw from. 
But what is being returned?

Enter your question (or 'exit'): What mathematicians are mentioned?

In the context of the new text, which discusses various books and ideas related to computer science, mathematics, and philosophy, I can identify several mathematicians who have contributed to the field. One notable mathematician is John Horton Conway, who made significant contributions to mathematical concepts that have been explored through computational simulations, such as the Game of Life. Another important figure is David Khan, whose book "The Code Breakers" provides a comprehensive history of cryptology from ancient times to the 20th century.

However, considering the new context and the emphasis on computer science and computational methods, I would like to highlight other mathematicians who have made significant contributions in this area. For instance, researchers such as Alan Turing, who is often considered the father of computer science, and Claude Shannon, who laid the foundation for modern cryptography and information theory.

Moreover, the mention of books and authors in the new text suggests that there may be additional mathematicians or computer scientists who have made significant contributions to the field. For example, the book "Genius at Play" by Siobhan Roberts discusses John Horton Conway's life and work, while "The Code Breakers" by David Khan provides a historical account of cryptology.

Overall, the mathematicians mentioned in this context are related to computer science and computational methods, highlighting their significant contributions to the intersection of mathematics and computer science.

Source documents:
2024-09-27.txt /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/../text/2024-09-27.txt
2020-08-16.txt /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/../text/2020-08-16.txt
2024-09-26.txt /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/../text/2024-09-26.txt
2024-04-25.txt /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/../text/2024-04-25.txt
2020-08-10.txt /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/../text/2020-08-10.txt
2024-06-23.txt /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/../text/2024-06-23.txt
2025-07-16.txt /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/../text/2025-07-16.txt
2016-08-21.txt /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/../text/2016-08-21.txt
2023-06-06.txt /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/../text/2023-06-06.txt
2020-08-20.txt /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/../text/2020-08-20.txt
2024-09-28.txt /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/../text/2024-09-28.txt
2023-01-30.txt /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/../text/2023-01-30.txt
2022-04-17.txt /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/../text/2022-04-17.txt
2016-05-13.txt /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/../text/2016-05-13.txt
2024-09-18.txt /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/../text/2024-09-18.txt
2024-04-16.txt /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/../text/2024-04-16.txt
2022-01-01.txt /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/../text/2022-01-01.txt
2012-12-26.txt /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/../text/2012-12-26.txt
2021-04-05.txt /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/../text/2021-04-05.txt
2021-11-24.txt /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/../text/2021-11-24.txt
2021-08-27.txt /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/../text/2021-08-27.txt
2024-11-14.txt /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/../text/2024-11-14.txt
2024-06-06.txt /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/../text/2024-06-06.txt
2016-05-29.txt /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/../text/2016-05-29.txt
2022-02-10.txt /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/../text/2022-02-10.txt
2022-04-17.txt /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/../text/2022-04-17.txt
2021-07-22.txt /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/../text/2021-07-22.txt
2021-11-23.txt /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/../text/2021-11-23.txt
2021-12-08.txt /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/../text/2021-12-08.txt
2017-01-18.txt /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/../text/2017-01-18.txt
2024-03-31.txt /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/../text/2024-03-31.txt
2023-03-09.txt /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/../text/2023-03-09.txt
2025-08-03.txt /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/../text/2025-08-03.txt
2023-07-01.txt /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/../text/2023-07-01.txt
2018-07-13.txt /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/../text/2018-07-13.txt
2008-04-04.txt /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/../text/2008-04-04.txt
2024-04-24.txt /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/../text/2024-04-24.txt
2024-03-09.txt /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/../text/2024-03-09.txt
2023-03-09.txt /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/../text/2023-03-09.txt
2020-07-01.txt /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/../text/2020-07-01.txt
2020-05-21.txt /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/../text/2020-05-21.txt
2017-01-09.txt /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/../text/2017-01-09.txt
2025-02-08.txt /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/../text/2025-02-08.txt
2022-06-17.txt /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/../text/2022-06-17.txt
2024-09-23.txt /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/../text/2024-09-23.txt
2015-10-28.txt /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/../text/2015-10-28.txt
2025-03-26.txt /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/../text/2025-03-26.txt
2024-12-15.txt /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/../text/2024-12-15.txt
2021-07-04.txt /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/../text/2021-07-04.txt
2025-03-20.txt /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/../text/2025-03-20.txt

***

It is clear that I have to design the query differently for the purpose.
When I tried to implement the SimilarityPostprocessor in the query engine, it didn't return anything. The simiarity scores must be very low.

    query_engine = index.as_query_engine(
        similarity_top_k=50,  # high cap
        node_postprocessors=[
            SimilarityPostprocessor(similarity_cutoff=0.78)  # keep only strong hits
        ],
    )  

What are they?
How is this working?

***

I modified the prompt. gemma3:1b follows the prompt instructions better than llama3.1:8b. But then it breaks? Weird.

It's also running longer. Maybe it got corrupted? Should I restart ollama?

I see one of the problems. Sometimes I have more than one model loaded. Use:

    ollama ps

    ollama stop MODEL_NAME

The performance is difficult to tune. Sometimes the models work and sometimes they don't. Perhaps the problem is the model size.

Let's go back and review some use cases in llamaindex:

semantic search

https://docs.llamaindex.ai/en/stable/understanding/putting_it_all_together/q_and_a/#semantic-search

summarization

https://docs.llamaindex.ai/en/stable/understanding/putting_it_all_together/q_and_a/#summarization