ssearch/devlog.md

# ssearch development log

## Active files (after Feb 27 reorganization)

- `build_store.py` — build/update journal vector store (incremental)
- `query_hybrid.py` — hybrid BM25 + vector query with LLM synthesis
- `retrieve.py` — hybrid verbatim chunk retrieval (no LLM)
- `search_keywords.py` — keyword search via POS-based term extraction
- `run_query.sh` — shell wrapper for interactive querying
- `clippings_search/build_clippings.py` — build/update clippings vector store (ChromaDB)
- `clippings_search/retrieve_clippings.py` — verbatim clippings retrieval
- `deploy_public.sh` — deploy public files to Forgejo

Earlier scripts moved to `archived/`:
`build.py`, `build_exp.py`, `query_topk.py`, `query_catalog.py`, `query_exp.py`,
`query_topk_prompt.py`, `query_topk_prompt_engine.py`, `query_topk_prompt_dw.py`,
`query_rewrite_hyde.py`, `query_multitool.py`, `shared/build.py`, `shared/query.py`,
`vs_metrics.py`, `claude_diagnostic.py`, `query_claude_sonnet.py`, `query_tree.py`,
`query_topk_prompt_engine_v3.py`, `retrieve_raw.py`

## Best configuration

- **Embedding**: BAAI/bge-large-en-v1.5, 256 token chunks, 25 token overlap
- **Re-ranker**: cross-encoder/ms-marco-MiniLM-L-12-v2 (retrieve top-30, re-rank to top-15)
- **LLM**: command-r7b via Ollama (temperature 0.3). OpenAI gpt-4o-mini available as alternative.
- **Retrieval**: hybrid BM25 + vector, cross-encoder re-ranked

## To do

1. [DONE] Test v3 (cross-encoder re-ranking) and compare results with v2.
   Selected ms-marco-MiniLM-L-12-v2 after testing three models.

2. [DONE] Verbatim retrieval mode (`retrieve_raw.py`). Uses
   `index.as_retriever()` instead of `index.as_query_engine()` to get
   chunks without LLM synthesis. Re-ranks with the same cross-encoder,
   then outputs raw chunk text with metadata and scores.

3. [DONE] Keyword search pipeline (`search_keywords.py`). Extracts
   nouns and adjectives via NLTK POS tagging, then greps data files.
   Complements vector search for exact names, places, dates.

4. [DONE] BM25 hybrid retrieval (sparse + dense). Two scripts:
   `query_hybrid.py` (with LLM synthesis) and `retrieve.py`
   (verbatim chunks, no LLM). Both run BM25 (top-20) and vector (top-20)
   retrievers, merge/deduplicate, then cross-encoder re-rank to top-15.
   Uses llama-index-retrievers-bm25.

5. Explore query expansion (multiple phrasings, merged retrieval)

6. Explore different vector store strategies (database)

7. [DONE] Test ChatGPT API for final LLM generation (instead of local Ollama)

8. [DONE] Remove API key from this file. Moved to `~/.bashrc` as `OPENAI_API_KEY`.

   The retrieval pipeline (embedding, vector search, cross-encoder re-ranking)
   stays the same. Only the final synthesis LLM changes.

   **Steps:**
   1. Install the LlamaIndex OpenAI integration:
      ```
      pip install llama-index-llms-openai
      ```
   2. Set API key as environment variable:
      ```
      export OPENAI_API_KEY="sk-..."
      ```
      (Or store in a `.env` file and load with python-dotenv. Do NOT commit
      the key to version control.)
   3. In the query script, replace the Ollama LLM with OpenAI:
      ```python
      # Current (local):
      from llama_index.llms.ollama import Ollama
      Settings.llm = Ollama(
          model="command-r7b",
          request_timeout=360.0,
          context_window=8000,
      )

      # New (API):
      from llama_index.llms.openai import OpenAI
      Settings.llm = OpenAI(
          model="gpt-4o-mini",   # or "gpt-4o" for higher quality
          temperature=0.1,
      )
      ```
   4. Run the query script as usual. Everything else (embedding model,
      vector store, cross-encoder re-ranker, prompt) is unchanged.
   5. Compare output quality and response time against command-r7b.

   Models to try: gpt-4o-mini (cheap, fast), gpt-4o (better quality).
   The prompt should work without modification since it's model-agnostic —
   just context + instructions.

   Note: This adds an external API dependency and per-query cost.
   The embedding and re-ranking remain fully local/offline.

   API KEY: moved to `~/.bashrc` as `OPENAI_API_KEY` (do not store in repo)

   **Getting an OpenAI API key:**
   1. Go to https://platform.openai.com/ and sign up (or log in).
   2. Navigate to API keys: Settings > API keys (or https://platform.openai.com/api-keys).
   3. Click "Create new secret key", give it a name, and copy it.
      The key starts with `sk-` and is shown only once.
   4. Add billing: Settings > Billing. Load a small amount ($5-10)
      to start. API calls are pay-per-use, not a subscription.
   5. Set the key in your shell before running a query:
      ```
      export OPENAI_API_KEY="sk-..."
      ```
      Or add to `~/.zshrc` (or `~/.bashrc`) to persist across sessions.
      Do NOT commit the key to version control or put it in scripts.

   **Approximate cost per query (Feb 2026):**
   - gpt-4o-mini: ~$0.001-0.003 (15 chunks of context)
   - gpt-4o: ~$0.01-0.03

---

## February 27, 2026

### Project reorganization

Reorganized the project structure with Claude Code. Goals: drop legacy version
numbers from filenames, archive superseded scripts, group clippings search into
a subdirectory, and clean up storage directory names.

**Script renames:**
- `build_exp_claude.py` → `build_store.py`
- `query_hybrid_bm25_v4.py` → `query_hybrid.py`
- `retrieve_hybrid_raw.py` → `retrieve.py`

**Archived (moved to `archived/`):**
- `query_topk_prompt_engine_v3.py` — superseded by hybrid BM25+vector query
- `retrieve_raw.py` — superseded by hybrid retrieval

**Clippings search subdirectory:**
- `build_clippings.py` → `clippings_search/build_clippings.py`
- `retrieve_clippings.py` → `clippings_search/retrieve_clippings.py`
- Scripts use `./` paths relative to project root, so no path changes needed
  when run as `python clippings_search/build_clippings.py` from root.

**Storage renames:**
- `storage_exp/` → `store/` (journal vector store)
- `storage_clippings/` → `clippings_search/store_clippings/` (clippings vector store)
- Deleted unused `storage/` (original August 2025 store, never updated)

**Updated references** in `run_query.sh`, `.gitignore`, `CLAUDE.md`, `README.md`,
and all Python scripts that referenced old storage paths.

### Deploy script (`deploy_public.sh`)

Created `deploy_public.sh` to automate publishing to Forgejo. Previously,
maintaining the public branch required manually recreating an orphan branch,
copying files, editing the README, and force-pushing — error-prone and tedious.

The script:
1. Checks that we're on `main` with no uncommitted changes
2. Deletes the local public branch and creates a fresh orphan
3. Copies listed public files from `main` (via `git checkout main -- <file>`)
4. Generates a public README by stripping private sections (Notebooks,
   Development history) and private file references using `awk`
5. Stages only the listed files (not untracked files on disk)
6. Commits with a message and force-pushes to `origin/public`
7. Switches back to `main`

Fixed a bug where `git add .` picked up untracked files (`output_test.txt`,
`run_retrieve.sh`). Changed to `git add "${PUBLIC_FILES[@]}" README.md`.

### Forgejo setup

Set up SSH push to Forgejo instance. Required adding SSH public key to Forgejo
user settings. The remote uses a Tailscale address.

### MIT License

Added MIT License (Copyright (c) 2026 E. M. Furst) to both main and public branches.

### Devlog migration

Migrated `devlog.txt` to `devlog.md` with markdown formatting.

---

## February 20, 2026

### Offline use: environment variables must be set before imports

Despite setting `HF_HUB_OFFLINE=1` and `SENTENCE_TRANSFORMERS_HOME=./models`
(added Feb 16), the scripts still failed offline with a `ConnectionError` trying
to reach huggingface.co. The error came from `AutoTokenizer.from_pretrained()`
calling `list_repo_templates()`, which makes an HTTP request to the HuggingFace API.

**Root cause:** the `huggingface_hub` library evaluates `HF_HUB_OFFLINE` at import
time, not at call time. The constant is set once in `huggingface_hub/constants.py`:

```python
HF_HUB_OFFLINE = _is_true(os.environ.get("HF_HUB_OFFLINE")
                           or os.environ.get("TRANSFORMERS_OFFLINE"))
```

In all four scripts, the `os.environ` lines came AFTER the imports:

```python
from llama_index.embeddings.huggingface import HuggingFaceEmbedding  # triggers import of huggingface_hub
from llama_index.core.postprocessor import SentenceTransformerRerank
import os

os.environ["HF_HUB_OFFLINE"] = "1"   # too late, constant already False
```

By the time `os.environ` was set, `huggingface_hub` had already imported and locked
the constant to `False`. The env var existed in the process environment but the
library never re-read it.

**Fix:** moved `import os` and all three `os.environ` calls to the top of each file,
before any llama_index or huggingface imports:

```python
import os
os.environ["TOKENIZERS_PARALLELISM"] = "false"
os.environ["SENTENCE_TRANSFORMERS_HOME"] = "./models"
os.environ["HF_HUB_OFFLINE"] = "1"

from llama_index.core import ...          # now these see the env vars
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
```

Updated scripts: `query_topk_prompt_engine_v3.py`, `retrieve_raw.py`,
`query_hybrid_bm25_v4.py`, `retrieve_hybrid_raw.py`.

**General lesson for offline HuggingFace use:**

The HuggingFace ecosystem has multiple libraries that check for offline mode:
- `huggingface_hub`: reads `HF_HUB_OFFLINE` (or `TRANSFORMERS_OFFLINE`) at import
- `transformers`: delegates to huggingface_hub's constant
- `sentence-transformers`: delegates to huggingface_hub's constant

All of them evaluate the flag ONCE at module load time. This means:
1. `os.environ` must be set before ANY import that touches `huggingface_hub`
2. Setting the env var in a "Globals" section after imports does NOT work
3. Even indirect imports count — `llama_index.embeddings.huggingface`
   transitively imports `huggingface_hub`, so the flag must precede it
4. Alternatively, set the env var in the shell before running Python:
   ```bash
   export HF_HUB_OFFLINE=1
   ```
   This always works because it's set before any Python code runs.
5. The newer `transformers` library (v4.50+) added `list_repo_templates()` in
   `AutoTokenizer.from_pretrained()`, which makes network calls that weren't
   present in earlier versions. This is why the Feb 16 fix worked initially
   (or appeared to) but broke after a package update.

This is a common pitfall for anyone running HuggingFace models offline (e.g.,
on a laptop without network, air-gapped environments, or behind restrictive
firewalls). The models are cached locally and work fine — but the library
still tries to check for updates unless the offline flag is set correctly.

---

### Incremental vector store updates

Added incremental update mode to `build_store.py` (then `build_exp_claude.py`).
Previously the script rebuilt the entire vector store from scratch every run
(~1848 files). Now it defaults to incremental mode: loads the existing index,
compares against `./data`, and only processes new, modified, or deleted files.

**Usage:**
```bash
python build_store.py            # incremental update (default)
python build_store.py --rebuild  # full rebuild from scratch
```

**How it works:**
- The LlamaIndex docstore (`store/docstore.json`) already tracks every
  indexed document with metadata: `file_name`, `file_size`, `last_modified_date`.
- The script scans `./data/*.txt` and classifies each file:
  - **New:** `file_name` not in docstore → insert
  - **Modified:** `file_size` or `last_modified_date` differs → delete + re-insert
  - **Deleted:** in docstore but not on disk → delete
  - **Unchanged:** skip
- Uses `index.insert()` and `index.delete_ref_doc()` from the LlamaIndex API.
- The same `SentenceSplitter` (256 tokens, 25 overlap) is applied via
  `Settings.transformations` so chunks match the original build.

**Timing:** incremental update with nothing to do takes ~17s (loading the index).
Full rebuild takes several minutes. First incremental run after a stale index
found 8 new files and 204 modified files, completed in ~65s.

**Important detail:** `SimpleDirectoryReader` converts file timestamps to UTC
(`datetime.fromtimestamp(mtime, tz=timezone.utc)`) before formatting as
`YYYY-MM-DD`. The comparison logic must use UTC too, or files modified late in
the day will show as "modified" due to the date rolling forward in UTC. This
caused a false-positive bug on the first attempt.

This enables running the build as a cron job to keep the vector store current
as new journal entries are added.

---

## February 18, 2026

### LLM comparison: gpt-4o-mini (OpenAI API) vs command-r7b (local Ollama)

Test query: "Passages that quote Louis Menand." (hybrid BM25+vector, v4)
Retrieval was identical (same 15 chunks, same scores) — only synthesis differs.
Results saved in `tests/results_openai.txt` and `tests/results_commandr7b.txt`.

**gpt-4o-mini:**
- Cited 6 files (2025-11-04, 2025-02-14, 2022-08-14, 2025-07-27,
  2025-02-05, 2024-09-04). Drew from chunks ranked as low as #14.
- Better at distinguishing direct quotes from paraphrases and indirect
  references. Provided a structured summary with numbered entries.
- 44 seconds total (most of that is local retrieval/re-ranking; the
  API call itself is nearly instant).

**command-r7b:**
- Cited 2 files (2025-11-04, 2022-08-14). Focused on the top-scored
  chunks and ignored lower-ranked ones.
- Pulled out actual quotes verbatim as block quotes — more useful if
  you want the exact text rather than a summary.
- 78 seconds total.

**Summary:** gpt-4o-mini is broader (more sources, better use of the full
context window) and nearly 2x faster. command-r7b is more focused and
reproduces exact quotes. Both correctly identified the core passages.
The quality difference is noticeable but not dramatic — the retrieval
pipeline does most of the heavy lifting.

### Temperature experiments

The gpt-4o-mini test used temperature=0.1 (nearly deterministic). command-r7b
via Ollama defaults to temperature=0.8 — so the two models were tested at very
different temperatures, which may account for some of the stylistic difference.

**Temperature guidance for RAG synthesis:**

| Range | Behavior | Use case |
|-------|----------|----------|
| 0.0–0.1 | Nearly deterministic. Picks highest-probability tokens. | Factual extraction, consistency. Can "tunnel vision." |
| 0.3–0.5 | Moderate. More varied phrasing, draws connections across chunks. | Good middle ground for RAG (prompt already constrains context). |
| 0.7–1.0 | Creative/varied. Riskier for RAG — may paraphrase loosely. | Not ideal for faithfulness to source text. |

**Follow-up: temperature=0.3 for both models (same query, same retrieval)**

**command-r7b at 0.3 (was 0.8):** Major improvement. Cited 6 files (was 2).
Drew from lower-ranked chunks including #15. Used the full context window
instead of fixating on top hits. Took 94s (was 78s) due to more output.

**gpt-4o-mini at 0.3 (was 0.1):** Nearly identical to 0.1 run. Same 6 files,
same structure. Slightly more interpretive phrasing but no meaningful
change. This model is less sensitive to temperature for RAG synthesis.

**Key finding:** Temperature is a critical but often overlooked parameter when
evaluating the generation stage of a RAG pipeline. In our tests, a local 7B model
(command-r7b) went from citing 2 sources to 6 — a 3x improvement in context
utilization — simply by lowering temperature from 0.8 to 0.3. At the higher
temperature, the model "wandered" during generation, focusing on the most salient
chunks and producing repetitive output. At the lower temperature, it methodically
worked through the full context window.

**Implications for RAG evaluation methodology:**
1. When comparing LLMs for RAG synthesis, temperature must be controlled
   across models. Our initial comparison (gpt-4o-mini at 0.1 vs
   command-r7b at 0.8 default) overstated the quality gap between models.
2. The "right" temperature for RAG is lower than for open-ended generation.
   The prompt and retrieved context already constrain the task; high
   temperature adds noise rather than creativity.
3. Temperature affects context utilization, not just style. A model that
   appears to "ignore" lower-ranked chunks may simply need a lower
   temperature to attend to them.
4. At temperature=0.3, a local 7B model and a cloud API model converged
   on similar quality (6 files cited, good coverage, mix of quotes and
   paraphrase). The retrieval pipeline does most of the heavy lifting;
   the generation model's job is to faithfully synthesize what was retrieved.

**Testing method:** Hold retrieval constant (same query, same vector store,
same re-ranker, same top-15 chunks). Vary only the LLM and temperature.
Compare on: number of source files cited, whether lower-ranked chunks
are used, faithfulness to source text, and total query time. Results
saved in `tests/` with naming convention `results_<model>_t<temp>.txt`.

---

### LlamaIndex upgrade to 0.14.14

Upgraded LlamaIndex from 0.13.1 to 0.14.14 to add OpenAI API support.

Installing `llama-index-llms-openai` pulled in `llama-index-core` 0.14.14, which
was incompatible with the existing companion packages (all pinned to <0.14).
Fixed by upgrading all companion packages together:

```bash
pip install --upgrade llama-index-embeddings-huggingface \
    llama-index-readers-file llama-index-llms-ollama \
    llama-index-retrievers-bm25
```

**Final package versions:**

| Package | Version | Was |
|---------|---------|-----|
| llama-index-core | 0.14.14 | 0.13.1 |
| llama-index-embeddings-huggingface | 0.6.1 | 0.6.0 |
| llama-index-llms-ollama | 0.9.1 | 0.7.0 |
| llama-index-llms-openai | 0.6.18 | new |
| llama-index-readers-file | 0.5.6 | 0.5.0 |
| llama-index-retrievers-bm25 | 0.6.5 | unchanged |
| llama-index-workflows | 2.14.2 | 1.3.0 |

Smoke test: `retrieve_raw.py "mining towns"` — works, same results as before.
No vector store rebuild needed. The existing store loaded fine with 0.14.

---

### Paragraph separator validation

Checked whether `paragraph_separator="\n\n"` in `build_store.py` makes sense
for the journal data.

Results from scanning all 1,846 files in `./data/`:
- 1,796 files (97%) use `\n\n` as paragraph boundaries
- 28 files use single newlines only
- 22 files have no newlines at all
- Average paragraphs per file: 10.8 (median 7, range 0–206)
- 900 files (49%) also use `---` as a topic/section separator

The `\n\n` setting is correct. `SentenceSplitter` tries to break at
`paragraph_separator` boundaries first, then falls back to sentence boundaries,
then words. With 256-token chunks, this keeps semantically related sentences
together within a paragraph.

The `---` separators are already surrounded by `\n\n` (e.g., `\n\n---\n\n`), so
they naturally act as break points too. No special handling needed.

Note: `"\n\n"` is actually the default value for `paragraph_separator` in
LlamaIndex's `SentenceSplitter`. The explicit setting documents intent but is
functionally redundant.

List-style entries with single newlines between items (e.g., `2001-09-14.txt`)
stay together within a chunk, which is desirable — lists shouldn't be split
line by line.

---

## February 16, 2026

### Cross-encoder model caching for offline use

Cached the cross-encoder model (`cross-encoder/ms-marco-MiniLM-L-12-v2`) in
`./models/` for offline use. Previously, `HuggingFaceEmbedding` already used
`cache_folder="./models"` with `local_files_only=True` for the embedding model,
but the cross-encoder (loaded via `SentenceTransformerRerank` → `CrossEncoder`)
had no `cache_folder` parameter and would fail offline when it tried to phone
home for updates.

**Fix:** all scripts that use the cross-encoder now set two environment variables
before model initialization:
```python
os.environ["SENTENCE_TRANSFORMERS_HOME"] = "./models"
os.environ["HF_HUB_OFFLINE"] = "1"
```

`SENTENCE_TRANSFORMERS_HOME` directs the `CrossEncoder` to look in `./models/`
for cached weights. `HF_HUB_OFFLINE` prevents any network access attempt.

The model was cached using `huggingface_hub.snapshot_download()`:
```python
from huggingface_hub import snapshot_download
snapshot_download('cross-encoder/ms-marco-MiniLM-L-12-v2', cache_dir='./models')
```

**Models now in `./models/`:**
- `models--BAAI--bge-large-en-v1.5` (embedding, bi-encoder)
- `models--cross-encoder--ms-marco-MiniLM-L-12-v2` (re-ranker, cross-encoder)
- `models--sentence-transformers--all-mpnet-base-v2` (old embedding, kept)

---

## February 15, 2026

### Design note on `search_keywords.py`

The POS tagger has a fundamental limitation: it was trained on declarative
prose, not imperative queries. A query like "Find passages that mention Louis
Menand" causes the tagger to classify "find" and "mention" as nouns (NN)
rather than verbs, because the imperative sentence structure is unusual in
its training data. This floods results with false positives (304 matches
across 218 files instead of the handful mentioning Menand).

More fundamentally: for term-based searches, the POS tagging layer adds
minimal value over bare grep. If the input is "Louis Menand", POS tagging
extracts "louis menand" — identical to what grep would match. The tool's
real value is not the NLP layer but the convenience wrapper: searching all
files at once, joining multi-word proper nouns, sorting by match count, and
showing context around matches. It's essentially a formatted multi-file grep.

Possible future direction: merge keyword search results with semantic search
results. The keyword pipeline catches exact names, places, and dates that
embeddings miss, while the semantic pipeline catches thematic relevance that
keywords miss. A hybrid approach could combine both result sets, using keyword
matches to boost or supplement vector retrieval. This connects to the BM25
hybrid retrieval idea (to-do item 4).

### New scripts: `query_hybrid_bm25_v4.py` and `retrieve_hybrid_raw.py`

Implemented BM25 hybrid retrieval (to-do item 4). Both scripts run two
retrievers in parallel on the same query:
- **Vector retriever:** top-20 by cosine similarity (semantic meaning)
- **BM25 retriever:** top-20 by term frequency (exact lexical matching)

Results are merged and deduplicated by node ID, then passed to the
cross-encoder re-ranker (`ms-marco-MiniLM-L-12-v2`) → top-15.

`query_hybrid_bm25_v4.py` feeds the re-ranked chunks to the LLM (same v3
prompt and command-r7b model). `retrieve_hybrid_raw.py` outputs the raw
chunks with source annotations: `[vector-only]`, `[bm25-only]`, or
`[vector+bm25]`, showing which retriever nominated each result.

The BM25 retriever uses `BM25Retriever.from_defaults(index=index)` from
`llama-index-retrievers-bm25` (v0.6.5). It indexes the nodes already
stored in the persisted vector store — no separate build step needed.

**Key idea:** BM25's job is only to nominate candidates that vector similarity
might miss (exact names, dates, specific terms). The cross-encoder decides
final relevance regardless of where candidates came from.

---

## February 12, 2026

Updated vector store, now 4816 chunks.

Scope of a language model based search: LLMs can summarize, but lack the
ability to critically read and compare information. ChatGPT can summarize
the literature that I've cited, but it cannot critique it. (It could
generate from published critiques.) Our ability to critically read and
synthesize from literature is an important skill. (Most reviews fall far
short, simply aggregating "advances" without asking why, how, or whether
they are real or not.)

---

## February 11, 2026

### Project tidy-up and cross-encoder re-ranking (v3)

Tidied up the project with Claude Code:
- Generated `README.md` and `CLAUDE.md` documentation
- Archived superseded scripts (v1 query engines, old build scripts, `shared/`,
  `experimental/query_multitool.py`)
- Removed stale `storage_exp` copy (Aug 2025 backup, ~105 MB)
- Removed empty `shared/` and `experimental/` directories

Created `query_topk_prompt_engine_v3.py`: adds cross-encoder re-ranking.

**The idea:** the current pipeline (v2) uses a bi-encoder (`BAAI/bge-large-en-v1.5`)
that encodes query and chunks independently, then compares via cosine similarity.
This is fast but approximate — the query and chunk never "see" each other.

A cross-encoder takes the query and chunk as a single concatenated input, with
full attention between all tokens. It scores the pair jointly, which captures
nuance that dot-product similarity misses (paraphrase, negation, indirect
relevance). The tradeoff is speed: you can't pre-compute scores.

**v3 uses a two-stage approach:**
1. Retrieve top-30 via bi-encoder (fast, approximate)
2. Re-rank to top-15 with cross-encoder (slow, precise)
3. Pass re-ranked chunks to LLM for synthesis

Cross-encoder model: `cross-encoder/ms-marco-MiniLM-L-6-v2` (~80 MB, 6 layers).
Trained on MS MARCO passage ranking. Should add only a few seconds to query time
for 30 candidates.

### Bi-encoder vs cross-encoder

**Bi-encoder (what the pipeline had):**
The embedding model (`BAAI/bge-large-en-v1.5`) encodes the query and each chunk
independently into vectors. Similarity is a dot product between two vectors that
were computed separately. This is fast — you can pre-compute all chunk vectors
once at build time and just compare against the query vector at search time. But
because query and chunk never "see" each other during encoding, the model can
miss subtle relevance signals.

**Cross-encoder (what v3 adds):**
A cross-encoder takes the query and a chunk as a single input pair:
`[query, chunk]` concatenated together. It reads both simultaneously through the
transformer, with full attention between every token in the query and every token
in the chunk. It outputs a single relevance score. This is much more accurate
because the model can reason about the specific relationship between your question
and the passage — word overlap, paraphrase, negation, context.

The tradeoff: it's slow. You can't pre-compute anything because the score depends
on the specific query. Scoring 4,692 chunks this way would take too long.

**Why the two-stage approach works:**
```
4,692 chunks  →  bi-encoder (fast, approximate)  →  top 30
    top 30    →  cross-encoder (slow, precise)    →  top 15
    top 15    →  LLM synthesis                    →  response
```

**Concrete example:** If you search "times the author felt conflicted about career
choices," the bi-encoder might rank a chunk about "job satisfaction" highly because
the vectors are close. But a chunk that says "I couldn't decide whether to stay or
leave" — without using the word "career" — might score lower in vector space. The
cross-encoder, reading both query and chunk together, would recognize that "couldn't
decide whether to stay or leave" is highly relevant to "felt conflicted about career
choices."

### Prompt update for v3

Updated the v3 prompt to account for re-ranked context. Changes:
- Tells the LLM the context is from a "personal journal collection" and has been
  "selected and ranked for relevance"
- "Examine ALL provided excerpts, not just the top few" — counters single-file
  collapse seen in initial testing
- "When multiple files touch on the query, note what each one contributes" —
  encourages breadth across sources
- "End with a list of all files that contributed" — stronger than v2's vague
  "list all relevant source files"

Also updated `run_query.sh` to point to v3.

### v3 test results

**Query: "Passages that describe mining towns."**
- Response cited 2 passages from `2023-03-15.txt` (coal mining, great-grandfather)
- Source documents included 7 distinct files across 15 chunks
- Top cross-encoder score: -1.177 (`2025-09-14.txt`)
- LLM focused on `2023-03-15.txt` which had the most explicit mining content
- Query time: 76 seconds
- Note: cross-encoder scores are raw logits (negative), not 0–1 cosine similarity

**Query: "I am looking for entries that discuss memes and cognition."**
- Response cited 6 distinct files with specific content from each:
  `2025-07-14` (Dennett/Blackmore on memes), `2023-09-20` (Hurley model),
  `2024-03-24` (multiple drafts model), `2021-04-25` (consciousness discussion),
  `2026-01-08` (epistemological frameworks), `2025-03-10` (Extended Mind Theory)
- Top cross-encoder score: 4.499 (`2026-01-08.txt`) — clear separation from rest
- LLM drew from chunks ranked 3rd, 4th, 5th, 12th, and 15th — confirming it
  examines the full context, not just top hits
- Query time: 71 seconds

**Observations:**
- The v3 prompt produces much better multi-source synthesis than v2's prompt
- Cross-encoder scores show clear separation between strong and weak matches
- The re-ranker + new prompt together encourage breadth across files
- Query time comparable to v2 (~70–80 seconds)

### Cross-encoder model comparison

Tested three cross-encoder models on the same query ("Discussions of Kondiaronk
and the Wendats") to compare re-ranking behavior.

**1. cross-encoder/ms-marco-MiniLM-L-12-v2 (baseline)**
- Scores: raw logits, wide spread (top score 3.702)
- Clear separation between strong and weak matches
- Balanced ranking: `2025-06-07.txt` #1, `2025-07-28.txt` #2, `2024-12-25.txt` #3
- Query time: ~70–80 seconds
- Trained on MS MARCO passage ranking (query → relevant passage)

**2. cross-encoder/stsb-roberta-base**
- Scores: 0.308 to 0.507 — very compressed range (0.199 spread)
- Poor differentiation: model can't clearly separate relevant from irrelevant
- Pulled in `2019-07-03.txt` at #2 (not in L-12 results), dropped `2024-12-25.txt`
- Query time: 92 seconds
- Trained on STS Benchmark (semantic similarity, not passage ranking) —
  wrong task for re-ranking. Measures "are these texts about the same thing?"
  rather than "is this passage a good answer to this query?"

**3. BAAI/bge-reranker-v2-m3**
- Scores: calibrated probabilities (0–1). Sharp top (0.812), then 0.313, 0.262…
  Bottom 6 chunks at 0.001 (model says: not relevant at all)
- Very confident about #1 (`2025-07-28.txt` at 0.812), but long zero tail
- 5 of 15 chunks from `2025-07-28.txt` — heavy concentration on one file
- Query time: 125 seconds (50% slower than L-12)
- Multilingual model, larger than ms-marco MiniLM variants

**Summary:**

| Model | Score spread | Speed | Differentiation |
|-------|-------------|-------|-----------------|
| ms-marco-MiniLM-L-12-v2 | Wide (logits) | ~70–80s | Good, balanced |
| BAAI/bge-reranker-v2-m3 | Sharp top/zeros | ~125s | Confident #1, weak tail |
| stsb-roberta-base | Compressed | ~92s | Poor |

**Decision:** ms-marco-MiniLM-L-12-v2 is the best fit. Purpose-built for passage
ranking, fastest of the three, and produces balanced rankings with good score
separation. The BAAI model's zero-tail problem means 6 of 15 chunks are dead
weight in the context window (could be mitigated by lowering `RERANK_TOP_N` or
adding a score cutoff, but adds complexity for marginal gain). The stsb model
is simply wrong for this task — semantic similarity ≠ passage relevance.

### New scripts: `retrieve_raw.py` and `search_keywords.py`

**`retrieve_raw.py`** — Verbatim chunk retrieval, no LLM. Uses the LlamaIndex
retriever API instead of the query engine:

```python
# v3 uses as_query_engine() — full pipeline including LLM synthesis
query_engine = index.as_query_engine(
    similarity_top_k=30,
    text_qa_template=PROMPT,
    node_postprocessors=[reranker],
)
response = query_engine.query(q)   # returns LLM-generated text

# retrieve_raw.py uses as_retriever() — stops after retrieval
retriever = index.as_retriever(similarity_top_k=30)
nodes = retriever.retrieve(q)       # returns raw NodeWithScore objects
reranked = reranker.postprocess_nodes(nodes, query_str=q)
```

The key distinction: `as_query_engine()` wraps retrieval + synthesis into one
call (retriever → node postprocessors → response synthesizer → LLM).
`as_retriever()` returns just the retriever component, giving back the raw
nodes with their text and metadata. The re-ranker's `postprocess_nodes()`
method can still be called manually on the retrieved nodes.

Each node has:
- `node.get_content()` — the chunk text
- `node.metadata` — dict with `file_name`, `file_path`, etc.
- `node.score` — similarity or re-ranker score

This separation is useful for inspecting what the pipeline retrieves before
the LLM processes it, and for building alternative output formats.

**`search_keywords.py`** — Keyword search via NLTK POS tagging. Completely
separate from the vector store pipeline. Extracts nouns (NN, NNS, NNP, NNPS)
and adjectives (JJ, JJR, JJS) from the query using `nltk.pos_tag()`, then
searches `./data/*.txt` with regex. Catches exact terms that embeddings miss.
NLTK data (`punkt_tab`, `averaged_perceptron_tagger_eng`) is auto-downloaded on
first run.

---

## January 12, 2026

### Best practices for query rewriting

1. **Understand the original intent:** Clarify the core intent behind the query.
   Sometimes that means expanding a terse question into a more descriptive one,
   or breaking a complex query into smaller, more focused sub-queries.

2. **Leverage LlamaIndex's built-in rewriting tools:** LlamaIndex has query
   transformation utilities that can help automatically rephrase or enrich
   queries. Use them as a starting point and tweak the results.

3. **Using a model to generate rewrites:** Have a language model generate a
   "clarified" version of the query. Feed the model the initial query and
   ask it to rephrase or add context.

**Step-by-step approach:**
- **Initial query expansion:** Take the raw user query and expand it with
  natural language context.
- **Model-assisted rewriting:** Use a model to generate alternate phrasings.
  Prompt with something like, "Please rewrite this query in a more detailed
  form for better retrieval results."
- **Testing and iteration:** Test rewritten versions and see which yield
  the best matches.

---

## January 1, 2026

Updated `storage_exp` by running `build_exp.py`.

---

## September 6, 2025

Rebuilt `storage_exp`: 2048 embeddings. Took about 4 minutes.

Need to experiment more with query rewrites. Save the query but match on
extracted terms? You can imagine an agent that decides between a search like
grep and a more semantic search. The search is not good at finding dates
("What did the author say on DATE") or when searching for certain terms
("What did the author say about libraries?").

---

## August 28, 2025

### Email embedding experiment

Idea: given a strong (or top) hit, use this node to find similar chunks.

Working with demo. Saved 294 emails from `president@udel.edu`. Embedding
these took nearly 45 minutes. The resulting vector store is larger than the
journals. The search is ok, but could be optimized by stripping the headers.

To make the text files:
```bash
textutil -convert txt *.eml
```

The resulting text: 145,204 lines, 335,690 words, 9,425,696 characters total
(~9.4 MB of text).

```
$ python build.py
Parsing nodes: 100%|████████| 294/294 [00:31<00:00,  9.28it/s]
Generating embeddings: ... (19 batches of 2048)

Total = 2,571 seconds = 42 minutes 51 seconds.
```

Vector store size:
```
$ ls -lh storage/
-rw-r--r--  867M  default__vector_store.json
-rw-r--r--  100M  docstore.json
-rw-r--r--   18B  graph_store.json
-rw-r--r--   72B  image__vector_store.json
-rw-r--r--  3.1M  index_store.json
```

That's a big vector store! The journals have a vector store that is only 90M
(an order of magnitude smaller) from a body of texts that is ~3 MB.

After extracting just the text/html from the eml files: 21,313 lines,
130,901 words, 946,474 characters total — much smaller. Build time dropped
to ~1:15. Store size dropped to ~25 MB.

---

## August 27, 2025

The wrapped query works great on the decwriter! Queries take about 83 seconds,
and sometimes up to 95 seconds if the model needs to be loaded. Longest query
so far (had to load all models) is 98 seconds.

---

## August 26, 2025

- Started an "experimental" folder for combining semantic + LLM-guided regex search.
- Created an "archive" folder for older versions.
- Wrote a shell script wrapper and a version that takes input on the command line.

Timed the retrieval (backup was running, so probably longer):
```
real    1m20.971s
user    0m13.074s
sys     0m1.429s
```

---

## August 25, 2025

- Build a bash wrapper around the python query engine. The bash wrapper would
  handle input and output.
- Expand the search to extract keywords and do a regex search on those. Can you
  search the real text chunks and sort by a similarity calc?
- What if you returned more results and sorted these by a cluster grouping?

---

## August 21, 2025

### HyDE experiments

HyDE stands for Hypothetical Document Embeddings.

Took out HyDE to test generation. Not sure HyDE is doing anything. Indeed, it is
not generating results that are any better or different than just using the
`BAAI/bge-large-en-v1.5` embedding model and a custom prompt. The BAAI/bge model
gives very good results!

**Compared llama3.1:8B with command-r7b.** Both are about the same size and give
similar results. ChatGPT is pretty adamant that command-r7b will stick more to
the retrieved content. This is reinforced by the following exercise:

**command-r7b output** (RAG faithfulness test):
> The last day you can file your 2023 taxes without incurring any penalties is
> April 15th, 2024. This is the official filing deadline for the 2023 tax year.
> Filing after this date will result in a late fee, with a 5% penalty per month
> up to a maximum of 25%.

**llama3.1:7b output:**
> April 15th, 2024.
>
> Note: The context only mentions the filing deadline and late fees, not any
> possible extensions or exceptions.

ChatGPT says: LLaMA 3 8B might answer correctly but add a guess like "extensions
are available." Command R 7B is more likely to stay within the context boundaries.
This is what we see.

---

## August 20, 2025

### Prompt engineering

Tried doing a query rewrite, but this is difficult. Reverted back. Got a pretty
good result with this question:

> "What would the author say about art vs. engineering?"

A prompt that starts with "What would the author say..." or "What does the author
say..." leads to higher similarity scores.

Implemented the HyDE rewrite of the prompt and that seems to lead to better
results, too.

### Prompt comparison

First prompt (research assistant, bulleted list):
```
"""You are a research assistant. You're given journal snippets (CONTEXT) and
a user query. Your job is NOT to write an essay but to list the best-matching
journal files with a 1–2 sentence rationale. ..."""
```

Second prompt (expert research assistant, theme + 10 files):
```
"""You are an expert research assistant. You are given top-ranked journal
excerpts (CONTEXT) and a user's QUERY. ... Format your answer in two parts:
1. Summary Theme  2. Matching Files (bullet list of 10)..."""
```

The second prompt provides better responses.

### Chunk size experiments

Experimenting with chunking. Using 512 and 10 overlap: 2412 vectors. Tried
512 tokens and 0 overlap. Changed the paragraph separator to `"\n\n"`. The
default is `"\n\n\n"` for some reason.

Reduced chunks to 256 tokens to see if higher similarity scores result. It
decreased them a bit. Tried 384 tokens and 40 overlap. The 256 and 25 worked
better — restored. Will work on semantic gap with the query.

### Embedding model switch

Switched the embedding model to `BAAI/bge-large-en-v1.5`. It seems to do
better, although it requires more time to embed the vector store.
Interestingly, the variance of the embedding values is much lower. The
distribution is narrower, although the values skew in a different way. There
is a broader distribution of clusters in the vectors.

---

## August 17, 2025

Working on the Jupyter notebook to measure stats of the vector store.

Links:
- [Summarization](https://docs.llamaindex.ai/en/stable/understanding/putting_it_all_together/q_and_a/#summarization)
- [Querying](https://docs.llamaindex.ai/en/stable/understanding/querying/querying/)
- [Indexing](https://docs.llamaindex.ai/en/stable/understanding/indexing/indexing/)
- [API Reference](https://docs.llamaindex.ai/en/stable/api_reference/)

---

## August 14, 2025

Ideas for the document search pipeline:
- Search by cosine similarity for semantic properties
- Generate search terms and search by regex — names, specific topics or words

**Problem:** HuggingFace requires internet connection.
**Solution:** download locally.

HuggingFace caches models at `~/.cache/huggingface/hub/`. It will redownload
them if forced to or if there is a model update.

**Solution:** ran first (online), which downloaded to the local directory.
Then used `local_files_only=True` to run offline:
```python
embed_model = HuggingFaceEmbedding(
    cache_folder="./models",
    model_name="all-mpnet-base-v2",
    local_files_only=True,
)
```

### LlamaIndex concepts

- **Nodes:** chunks of text (paragraphs, sentences) extracted from documents.
  Stored in the document store (e.g., `SimpleDocumentStore`), which keeps
  track of the original text and metadata.
- **Vector store:** stores embeddings of nodes. Each entry corresponds to a
  node's embedding vector. Query results include node IDs (or metadata)
  that link back to the original nodes in the document store.
- Vector store entries are linked to their full content via metadata (e.g., node ID).

---

## August 12, 2025

Want to understand the vector store better:
- Is it effective? Are queries effective?
- How many entries are there?
- Why doesn't it find Katie Hafner, but it does find Jimmy Soni?

Query results are improved with a better prompt. Increased top-k to 50 to
give the model more text to draw from. But it hallucinates at the end of
longer responses.

The `SimilarityPostprocessor` with `similarity_cutoff=0.78` returned nothing.
The similarity scores must be very low.

Performance is difficult to tune. Sometimes the models work and sometimes
they don't. Multiple models loaded simultaneously causes issues — use
`ollama ps` and `ollama stop MODEL_NAME`.

---

## August 10, 2025

### Project start

Files made today: `build.py`, `query_topk.py`, `query.py`.

Build a semantic search of journal texts:
- Ingest all texts and metadata
- Search and return relevant text and file information

Created `.venv` environment:
```bash
python3 -m venv .venv
pip install llama-index-core llama-index-readers-file \
    llama-index-llms-ollama llama-index-embeddings-huggingface
```

Ran `build.py` successfully and generated store. `SimpleDirectoryReader`
stores the filename and file path as metadata.

**Model comparison (initial):** llama3.1:8B, deepseek-r1:8B, gemma3:1b.
Can't get past a fairly trivial query engine right now. These aren't very
powerful models. Need to keep testing and see what happens.