Initial commit: RAG pipeline for semantic search over personal journal archive

Vector search with cross-encoder re-ranking, hybrid BM25+vector retrieval, incremental index updates, and multiple LLM backends (Ollama local, OpenAI API).
2026-02-20 06:02:28 -05:00 · 2026-02-20 06:02:28 -05:00 · e9fc99ddc6
commit e9fc99ddc6
43 changed files with 7349 additions and 0 deletions
--- a/.gitignore
+++ b/.gitignore
@ -0,0 +1,31 @@
 # Python
 .venv/
 __pycache__/
 *.pyc
 # HuggingFace cached models (large, ~2 GB)
 models/
 # Vector stores (large, rebuild with build_exp_claude.py)
 storage_exp/
 storage/
 # Data (symlink to private journal files)
 data
 # IDE and OS
 .DS_Store
 .vscode/
 .idea/
 # Jupyter checkpoints
 .ipynb_checkpoints/
 # Secrets
 .env
 # Query log
 query.log
 # Duplicate of CLAUDE.md
 claude.md
--- a/NOTES.md
+++ b/NOTES.md
@ -0,0 +1,13 @@
 Simple query in ChatGPT produced 
 Metric | Best For | Type | Notes
 -- | -- | -- | --
 Cosine Similarity | L2-normalized vectors | Similarity | Scale-invariant
 Dot Product | Transformer embeddings | Similarity | Fast, especially on GPUs
 Euclidean Distance | Raw vectors with meaningful norms | Distance | Sensitive to scale
 Jaccard | Sparse binary or set-based data | Similarity | Discrete features
 Soft Cosine | Sparse with semantic overlap | Similarity | Better for text-term overlap
 Learned Similarity | Fine-tuned deep models | Varies | Best accuracy, slowest retrieval
--- a/README.md
+++ b/README.md
@ -0,0 +1,208 @@
 # ssearch
 Semantic search over a personal journal archive. Uses vector embeddings and a local LLM to find and synthesize information across 1800+ dated text entries spanning 2000-2025.
 ## How it works
 ```
 Query → Embed (BAAI/bge-large-en-v1.5) → Vector similarity (top-30) → Cross-encoder re-rank (top-15) → LLM synthesis (command-r7b via Ollama, or OpenAI API) → Response + sources
 ```
 1. **Build**: Journal entries in `./data` are chunked (256 tokens, 25-token overlap) and embedded into a vector store using LlamaIndex. Supports incremental updates (new/modified files only) or full rebuilds.
 2. **Retrieve**: A user query is embedded with the same model and matched against stored vectors by cosine similarity, returning the top 30 candidate chunks.
 3. **Re-rank**: A cross-encoder (`cross-encoder/ms-marco-MiniLM-L-12-v2`) scores each (query, chunk) pair jointly and keeps the top 15.
 4. **Synthesize**: The re-ranked chunks are passed to a local LLM with a custom prompt that encourages multi-source synthesis, producing a grounded answer with file citations.
 ## Project structure
 ```
 ssearch/
 ├── build_exp_claude.py             # Build/update vector store (incremental by default)
 ├── query_topk_prompt_engine_v3.py  # Main query engine (cross-encoder re-ranking)
 ├── query_topk_prompt_engine_v2.py  # Previous query engine (no re-ranking)
 ├── retrieve_raw.py                 # Verbatim chunk retrieval (no LLM)
 ├── query_hybrid_bm25_v4.py        # Hybrid BM25 + vector query (v4)
 ├── retrieve_hybrid_raw.py          # Hybrid verbatim retrieval (no LLM)
 ├── search_keywords.py              # Keyword search via POS-based term extraction
 ├── run_query.sh                    # Shell wrapper with timing and logging
 ├── data/                           # Symlink to ../text/ (journal .txt files)
 ├── storage_exp/                    # Persisted vector store (~242 MB)
 ├── models/                         # Cached HuggingFace models (embedding + cross-encoder, offline)
 ├── archived/                       # Earlier iterations and prototypes
 ├── saved_output/                   # Saved query results and model comparisons
 ├── requirements.txt                # Python dependencies (pip freeze)
 ├── NOTES.md                        # Similarity metric reference
 ├── devlog.txt                      # Development log and experimental findings
 └── *.ipynb                         # Jupyter notebooks (HyDE, metrics, sandbox)
 ```
 ## Setup
 **Prerequisites**: Python 3.12, [Ollama](https://ollama.com) with `command-r7b` pulled.
 ```bash
 cd ssearch
 python3 -m venv .venv
 source .venv/bin/activate
 pip install -r requirements.txt
 ```
 The `data/` symlink should point to `../text/` (the journal archive). The embedding model (`BAAI/bge-large-en-v1.5`) and cross-encoder (`cross-encoder/ms-marco-MiniLM-L-12-v2`) are cached in `./models/` for offline use.
 ### Offline model loading
 All query scripts set three environment variables to prevent HuggingFace from making network requests:
 ```python
 os.environ["TOKENIZERS_PARALLELISM"] = "false"
 os.environ["SENTENCE_TRANSFORMERS_HOME"] = "./models"
 os.environ["HF_HUB_OFFLINE"] = "1"
 ```
 **These must appear before any imports that touch HuggingFace libraries.** The `huggingface_hub` library evaluates `HF_HUB_OFFLINE` once at import time (in `huggingface_hub/constants.py`). If the env var is set after imports, the library will still attempt network access and fail offline. This is a common pitfall -- `llama_index.embeddings.huggingface` transitively imports `huggingface_hub`, so even indirect imports trigger the evaluation.
 Alternatively, set the variable in your shell before running Python:
 ```bash
 export HF_HUB_OFFLINE=1
 python query_hybrid_bm25_v4.py "your query"
 ```
 ## Usage
 ### Build the vector store
 ```bash
 # Incremental update (default): only processes new, modified, or deleted files
 python build_exp_claude.py
 # Full rebuild from scratch
 python build_exp_claude.py --rebuild
 ```
 The default incremental mode loads the existing index, compares file sizes and modification dates against the docstore, and only re-indexes what changed. A full rebuild (`--rebuild`) is only needed when chunk parameters or the embedding model change.
 ### Search
 Three categories of search are available, from heaviest (semantic + LLM) to lightest (grep).
 #### Semantic search with LLM synthesis
 These scripts embed the query, retrieve candidate chunks from the vector store, re-rank with a cross-encoder, and pass the top results to a local LLM that synthesizes a grounded answer with file citations. **Requires Ollama running with `command-r7b`.**
 **Vector-only** (`query_topk_prompt_engine_v3.py`): Retrieves the top 30 chunks by cosine similarity, re-ranks to top 15, synthesizes.
 ```bash
 python query_topk_prompt_engine_v3.py "What does the author say about creativity?"
 ```
 **Hybrid BM25 + vector** (`query_hybrid_bm25_v4.py`): Retrieves top 20 by vector similarity and top 20 by BM25 term frequency, merges and deduplicates, re-ranks the union to top 15, synthesizes. Catches exact name/term matches that vector-only retrieval misses.
 ```bash
 python query_hybrid_bm25_v4.py "Louis Menand"
 ```
 **Interactive wrapper** (`run_query.sh`): Loops for queries using the v3 engine, displays timing, and appends queries to `query.log`.
 ```bash
 ./run_query.sh
 ```
 #### Verbatim chunk retrieval (no LLM)
 These scripts run the same retrieval and re-ranking pipeline but output the raw chunk text instead of passing it to an LLM. Useful for inspecting what the retrieval pipeline finds, or when Ollama is not available. **No Ollama needed.**
 **Vector-only** (`retrieve_raw.py`): Top-30 vector retrieval, cross-encoder re-rank to top 15, raw output.
 ```bash
 python retrieve_raw.py "Kondiaronk and the Wendats"
 ```
 **Hybrid BM25 + vector** (`retrieve_hybrid_raw.py`): Same hybrid retrieval as v4 but outputs raw chunks. Each chunk is annotated with its source: `[vector-only]`, `[bm25-only]`, or `[vector+bm25]`.
 ```bash
 python retrieve_hybrid_raw.py "Louis Menand"
 ```
 Pipe either to `less` for browsing.
 #### Keyword search (no vector store, no LLM)
 **`search_keywords.py`**: Extracts nouns and adjectives from the query using NLTK POS tagging, then greps `./data/*.txt` for matches with surrounding context. A lightweight fallback when you want exact string matching without the vector store. **No vector store or Ollama needed.**
 ```bash
 python search_keywords.py "Discussions of Kondiaronk and the Wendats"
 ```
 ### Output format
 ```
 Response:
 <LLM-synthesized answer citing specific files>
 Source documents:
 2024-03-15.txt  ./data/2024-03-15.txt  0.683
 2023-11-02.txt  ./data/2023-11-02.txt  0.651
 ...
 ```
 ## Configuration
 Key parameters (set in source files):
 | Parameter | Value | Location |
 |-----------|-------|----------|
 | Embedding model | `BAAI/bge-large-en-v1.5` | `build_exp_claude.py`, `query_topk_prompt_engine_v3.py` |
 | Chunk size | 256 tokens | `build_exp_claude.py` |
 | Chunk overlap | 25 tokens | `build_exp_claude.py` |
 | Paragraph separator | `\n\n` | `build_exp_claude.py` |
 | Initial retrieval | 30 chunks | `query_topk_prompt_engine_v3.py` |
 | Re-rank model | `cross-encoder/ms-marco-MiniLM-L-12-v2` | `query_topk_prompt_engine_v3.py` |
 | Re-rank top-n | 15 | `query_topk_prompt_engine_v3.py` |
 | LLM | `command-r7b` (Ollama) or `gpt-4o-mini` (OpenAI API) | `query_topk_prompt_engine_v3.py`, `query_hybrid_bm25_v4.py` |
 | Temperature | 0.3 (recommended for both local and API models) | `query_topk_prompt_engine_v3.py`, `query_hybrid_bm25_v4.py` |
 | Context window | 8000 tokens | `query_topk_prompt_engine_v3.py` |
 | Request timeout | 360 seconds | `query_topk_prompt_engine_v3.py` |
 ## Key dependencies
 - **llama-index-core** (0.14.14) -- RAG framework
 - **llama-index-embeddings-huggingface** (0.6.1) -- embedding integration
 - **llama-index-llms-ollama** (0.9.1) -- local LLM via Ollama
 - **llama-index-llms-openai** (0.6.18) -- OpenAI API LLM (optional, for API-based synthesis)
 - **llama-index-readers-file** (0.5.6) -- file readers
 - **llama-index-retrievers-bm25** (0.6.5) -- BM25 sparse retrieval for hybrid search
 - **sentence-transformers** (5.1.0) -- embedding model support
 - **torch** (2.8.0) -- ML runtime
 ## Notebooks
 Three Jupyter notebooks document exploration and analysis:
 - **`hyde.ipynb`** -- Experiments with HyDE (Hypothetical Document Embeddings) query rewriting. Tests whether generating a hypothetical answer to a query and embedding that instead improves retrieval. Uses LlamaIndex's `HyDEQueryTransform` with `llama3.1:8B`. Finding: the default HyDE prompt produced a rich hypothetical passage, but the technique did not improve retrieval quality over direct prompt engineering. This informed the decision to drop HyDE from the pipeline.
 - **`sandbox.ipynb`** -- Exploratory notebook for learning the LlamaIndex API. Inspects the `llama_index.core` module (104 objects), lists available classes and methods, and reads the source of `VectorStoreIndex`. Useful as a quick reference for what LlamaIndex exposes.
 - **`vs_metrics.ipynb`** -- Quantitative analysis of the vector store. Loads the persisted index (4,692 vectors, 1024 dimensions each from `BAAI/bge-large-en-v1.5`) and produces:
  - Distribution of embedding values (histogram)
  - Heatmap of the full embedding matrix
  - Embedding vector magnitude distribution
  - Per-dimension variance (which dimensions carry more signal)
  - Pairwise cosine similarity distribution and heatmap (subset)
  - Hierarchical clustering dendrogram (Ward linkage)
  - PCA and t-SNE 2D projections of the embedding space
 ## Design decisions
 - **BAAI/bge-large-en-v1.5 over all-mpnet-base-v2**: Better semantic matching quality for journal text despite slower embedding.
 - **256-token chunks**: Tested 512 and 384; 256 with 25-token overlap produced the highest quality matches.
 - **command-r7b over llama3.1:8B**: Sticks closer to provided context with less hallucination at comparable speed.
 - **Top-k=15**: Wide enough to capture diverse perspectives, narrow enough to fit the context window.
 - **Cross-encoder re-ranking (v3)**: Retrieve top-30 via bi-encoder, re-rank to top-15 with a cross-encoder that scores each (query, chunk) pair jointly. More accurate than bi-encoder similarity alone. Tested three models; `ms-marco-MiniLM-L-12-v2` selected over `stsb-roberta-base` (wrong task -- semantic similarity, not passage ranking) and `BAAI/bge-reranker-v2-m3` (50% slower, weak score tail).
 - **HyDE query rewriting tested and dropped**: Did not improve results over direct prompt engineering.
 - **V3 prompt**: Adapted for re-ranked context -- tells the LLM all excerpts have been curated, encourages examining every chunk and noting what each file contributes. Produces better multi-source synthesis than v2's prompt.
 - **V2 prompt**: More flexible and query-adaptive than v1, which forced rigid structure (exactly 10 files, mandatory theme).
 - **Verbatim retrieval (`retrieve_raw.py`)**: Uses LlamaIndex's `index.as_retriever()` instead of `index.as_query_engine()`. The retriever returns raw `NodeWithScore` objects (chunk text, metadata, scores) without invoking the LLM. The re-ranker is applied manually via `reranker.postprocess_nodes()`. This separation lets you inspect what the pipeline retrieves before synthesis.
 - **Keyword search (`search_keywords.py`)**: NLTK POS tagging extracts nouns and adjectives from the query -- a middle ground between naive stopword removal and LLM-based term extraction. Catches exact names, places, and dates that vector similarity misses.
 - **Hybrid BM25 + vector retrieval (v4)**: Runs two retrievers in parallel -- BM25 (top-20 by term frequency) and vector similarity (top-20 by cosine) -- merges and deduplicates candidates, then lets the cross-encoder re-rank the union to top-15. BM25 nominates candidates with exact term matches that embeddings miss; the cross-encoder decides final relevance. Uses `BM25Retriever.from_defaults(index=index)` from `llama-index-retrievers-bm25`, which indexes the nodes already stored in the persisted vector store.
 ## Development history
 - **Aug 2025**: Initial implementation -- build pipeline, embedding model comparison, chunk size experiments, HyDE testing, prompt v1.
 - **Jan 2026**: Command-line interface, v2 prompt, error handling improvements, model comparison (command-r7b selected).
 - **Feb 2026**: Project tidy-up, cross-encoder re-ranking (v3), v3 prompt for multi-source synthesis, cross-encoder model comparison (L-12 selected), archived superseded scripts. Hybrid BM25 + vector retrieval (v4). Upgraded LlamaIndex from 0.13.1 to 0.14.14; added OpenAI API as optional LLM backend (`llama-index-llms-openai`). Incremental vector store updates (default mode in `build_exp_claude.py`). Fixed offline HuggingFace model loading (env vars must precede imports).
 See `devlog.txt` for detailed development notes and experimental findings.
--- a/archived/build.py
+++ b/archived/build.py
@ -0,0 +1,51 @@
 # build.py
 #
 # Import documents from data, generate embedded vector store
 # and save to disk in directory ./storage
 #
 # August 2025
 # E. M. Furst
 from llama_index.core import (
    SimpleDirectoryReader,
    VectorStoreIndex,
    Settings,
 )
 from llama_index.embeddings.huggingface import HuggingFaceEmbedding
 from llama_index.core.node_parser import SentenceSplitter
 def main():
    # Choose your embedding model
    embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-large-en-v1.5")
    # Configure global settings for LlamaIndex
    Settings.embed_model = embed_model
    # Load documents
    documents = SimpleDirectoryReader("./data").load_data()
    # Create the custom textsplitter
    # Set chunk size and overlap (e.g., 256 tokens, 25 tokens overlap)
    # see https://docs.llamaindex.ai/en/stable/api_reference/node_parsers/sentence_splitter/#llama_index.core.node_parser.SentenceSplitter
    text_splitter = SentenceSplitter(
        chunk_size=256, 
        chunk_overlap=25,
        paragraph_separator="\n\n",  # use double newlines to separate paragraphs
    )
    Settings.text_splitter = text_splitter
    # Build the index 
    index = VectorStoreIndex.from_documents(
        documents, transformations=[text_splitter],
        show_progress=True,
    )
    # Persist both vector store and index metadata
    index.storage_context.persist(persist_dir="./storage")
    print("Index built and saved to ./storage")
 if __name__ == "__main__":
    main()
--- a/archived/build_exp.py
+++ b/archived/build_exp.py
@ -0,0 +1,68 @@
 # build_exp.py
 #
 # Import document from data, generate embedded vector store
 # and save to disk
 #
 # Experiment to include text chunking with a textsplitter
 #
 # August 2025
 # E. M. Furst
 from llama_index.core import (
    SimpleDirectoryReader,
    VectorStoreIndex,
    Settings,
 )
 from pathlib import Path
 from llama_index.embeddings.huggingface import HuggingFaceEmbedding
 from llama_index.core.node_parser import SentenceSplitter
 def main():
    # Choose your embedding model
    #embed_model = HuggingFaceEmbedding(model_name="all-mpnet-base-v2")
    # embedding is slower with BAAI/bge-large-en-v1.5
    embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-large-en-v1.5")
    # Configure global settings for LlamaIndex
    Settings.embed_model = embed_model
    # Load documents (capabilities?)
    documents = SimpleDirectoryReader(
        "./data",
        # # p is a string path
        # file_metadata=lambda p: {
        #     "filename": Path(p).name,            # just the file name
        #     "filepath": str(Path(p).resolve()),  # absolute path (handy for tracing)
        # },
    ).load_data()
    # Create the custom textsplitter
    # Set chunk size and overlap (e.g., 512 tokens, 10 toekns overlap)
    # see https://docs.llamaindex.ai/en/stable/api_reference/node_parsers/sentence_splitter/#llama_index.core.node_parser.SentenceSplitter
    text_splitter = SentenceSplitter(
        chunk_size=256, 
        chunk_overlap=25,
        paragraph_separator="\n\n",  # use double newlines to separate paragraphs
    )
    # b/c passing text_splitter in the index build, this may cause problems
    # test with it commented out...
    #    Settings.text_splitter = text_splitter
    # Build the index 
    index = VectorStoreIndex.from_documents(
        documents, transformations=[text_splitter],
        show_progress=True,
    )
    # Persist both vector store and index metadata
    index.storage_context.persist(persist_dir="./storage_exp")
 #    storage_context = StorageContext.from_defaults(vector_store=index.vector_store)
 #    storage_context.persist(persist_dir="./storage")
    print("Index built and saved to ./storage_exp")
 if __name__ == "__main__":
    main()
--- a/archived/claude_diagnostic.py
+++ b/archived/claude_diagnostic.py
@ -0,0 +1,164 @@
 # Better HyDE debugging with targeted tests
 from llama_index.core.indices.query.query_transform import HyDEQueryTransform
 from llama_index.core import PromptTemplate
 from llama_index.core import Settings
 from llama_index.core.base.base_query_engine import BaseQueryEngine
 from llama_index.llms.ollama import Ollama
 llm="llama3.1:8B"
 # Use a local model to generate
 Settings.llm = Ollama(
    model=llm,    # First model tested
    request_timeout=360.0,
    context_window=8000,
    temperature=0.7,
    )
 # Test queries that should produce very different hypothetical documents
 test_queries = [
    "What is the capital of France?",
    "How do you make chocolate chip cookies?", 
    "Explain quantum physics",
    "Write a love letter",
    "Describe symptoms of the common cold"
 ]
 print("=== DEBUGGING HYDE STEP BY STEP ===\n")
 # 1. Test the LLM with HyDE-style prompts directly
 print("1. Testing LLM directly with HyDE-style prompts:")
 print("-" * 50)
 for query in test_queries[:2]:  # Just test 2 to keep output manageable
    direct_prompt = f"""Generate a hypothetical document that would contain the answer to this query.
 Query: {query}
 Hypothetical document:"""
    response = Settings.llm.complete(direct_prompt)
    print(f"Query: {query}")
    print(f"Direct LLM Response: {response.text[:100]}...")
    print()
 # 2. Check HyDE internals - let's see what's actually happening
 print("\n2. Examining HyDE internal behavior:")
 print("-" * 50)
 # Create a custom HyDE that shows us everything
 class VerboseHyDETransform(HyDEQueryTransform):
    def _get_prompts(self):
        """Show what prompts are being used"""
        prompts = super()._get_prompts()
        print(f"HyDE prompts: {prompts}")
        return prompts
    def _run_component(self, **kwargs):
        """Show what's being passed to the LLM"""
        print(f"HyDE _run_component kwargs: {kwargs}")
        result = super()._run_component(**kwargs)
        print(f"HyDE _run_component result: {result}")
        return result
 # Test with verbose HyDE
 verbose_hyde = VerboseHyDETransform(llm=Settings.llm)
 test_result = verbose_hyde.run("What is machine learning?")
 print(f"Final verbose result: {test_result}")
 # 3. Try the most basic possible test
 print("\n3. Most basic HyDE test:")
 print("-" * 50)
 basic_hyde = HyDEQueryTransform(llm=Settings.llm)
 basic_result = basic_hyde.run("Paris")
 print(f"Input: 'Paris'")
 print(f"Output: '{basic_result}'")
 print(f"Same as input? {basic_result.strip() == 'Paris'}")
 # 4. Check if it's a version issue - try alternative approach
 print("\n4. Alternative HyDE approach:")
 print("-" * 50)
 try:
    # Some versions might need different initialization
    from llama_index.core.query_engine import TransformQueryEngine
    from llama_index.core.indices.query.query_transform import HyDEQueryTransform
    # Try with explicit prompt template
    hyde_prompt_template = PromptTemplate(
        "Please write a passage to answer the question\n"
        "Try to include as many key details as possible\n"
        "\n"
        "\n"
        "Passage:{query_str}\n"
        "\n"
        "\n"
        "Passage:"
    )
    alt_hyde = HyDEQueryTransform(
        llm=Settings.llm,
        hyde_prompt=hyde_prompt_template
    )
    alt_result = alt_hyde.run("What causes rain?")
    print(f"Alternative approach result: {alt_result}")
 except Exception as e:
    print(f"Alternative approach failed: {e}")
 # 5. Check what happens with different query formats
 print("\n5. Testing different input formats:")
 print("-" * 50)
 from llama_index.core.schema import QueryBundle
 # Test with QueryBundle vs string
 hyde_test = HyDEQueryTransform(llm=Settings.llm)
 string_result = hyde_test.run("test query")
 print(f"String input result: '{string_result}'")
 query_bundle = QueryBundle(query_str="test query")
 bundle_result = hyde_test.run(query_bundle)
 print(f"QueryBundle input result: '{bundle_result}'")
 # 6. Version and import check
 print("\n6. Environment check:")
 print("-" * 50)
 import llama_index
 print(f"LlamaIndex version: {llama_index.__version__}")
 # Check what LLM you're actually using
 print(f"LLM type: {type(Settings.llm)}")
 print(f"LLM model name: {getattr(Settings.llm, 'model', 'Unknown')}")
 # 7. Try the nuclear option - completely manual implementation
 print("\n7. Manual HyDE implementation:")
 print("-" * 50)
 def manual_hyde(query: str, llm):
    """Completely manual HyDE to see if the concept works"""
    prompt = f"""You are an expert writer. Generate a realistic document excerpt that would contain the answer to this question.
 Question: {query}
 Document excerpt:"""
    response = llm.complete(prompt)
    return response.text
 manual_result = manual_hyde("What is photosynthesis?", Settings.llm)
 print(f"Manual HyDE result: {manual_result[:150]}...")
 # 8. Final diagnostic
 print("\n8. Final diagnostic questions:")
 print("-" * 50)
 print("If all the above show the LLM generating proper responses but HyDE still returns original:")
 print("- What LLM are you using? (OpenAI, Anthropic, local model, etc.)")
 print("- What's your LlamaIndex version?")
 print("- Are there any error messages in the logs?")
 print("- Does the LLM have any special configuration or wrappers?")
--- a/archived/output.png
+++ b/archived/output.png
--- a/archived/query.py
+++ b/archived/query.py
@ -0,0 +1,110 @@
 # query_topk_prompt.py
 # Run a querry on a vector store
 # 
 # E. M. Furst August 2025
 from llama_index.core import (
    load_index_from_storage,
    StorageContext,
    Settings,
 )
 from llama_index.embeddings.huggingface import HuggingFaceEmbedding
 from llama_index.llms.ollama import Ollama
 from llama_index.core.prompts import PromptTemplate
 import os
 #
 # Globals
 #
 os.environ["TOKENIZERS_PARALLELISM"] = "false"
 # Embedding model used in vector store (this should match the one in build.py or equivalent)
 embed_model = HuggingFaceEmbedding(cache_folder="./models",model_name="BAAI/bge-large-en-v1.5")
 # LLM model to use in query transform and generation
 llm="command-r7b"
 #
 # Custom prompt for the query engine
 #
 PROMPT = PromptTemplate(
 """You are an expert research assistant. You are given top-ranked writing excerpts (CONTEXT) and a user's QUERY.
 Instructions:
 - Base your response *only* on the CONTEXT.
 - The snippets are ordered from most to least relevant—prioritize insights from earlier (higher-ranked) snippets.
 - Aim to reference *as many distinct* relevant files as possible (up to 10).
 - Do not invent or generalize; refer to specific passages or facts only.
 - If a passage only loosely matches, deprioritize it.
 Format your answer in two parts:
 1. **Summary Theme**  
   Summarize the dominant theme from the relevant context in a few sentences.
 2. **Matching Files**  
   Make a list of 10 matching files. The format for each should be:  
   <filename> - 
   <rationale tied to content. Include date or section hints if available.>
 CONTEXT:
 {context_str}
 QUERY:
 {query_str}
 Now provide the theme and list of matching files."""
 )
 #
 # Main program routine
 #
 def main():
    # Use a local model to generate -- in this case using Ollama
    Settings.llm = Ollama(
        model=llm,    # First model tested
        request_timeout=360.0,
        context_window=8000
        )
    # Load embedding model (same as used for vector store)
    Settings.embed_model = embed_model
    # Load persisted vector store + metadata
    storage_context = StorageContext.from_defaults(persist_dir="./storage_exp")
    index = load_index_from_storage(storage_context)
    # Build regular query engine with custom prompt
    query_engine = index.as_query_engine(
        similarity_top_k=15,            # pull wide 
        #response_mode="compact"        # concise synthesis
        text_qa_template=PROMPT,        # custom prompt
        # node_postprocessors=[
        #     SimilarityPostprocessor(similarity_cutoff=0.75)  # keep strong hits; makes result count flexible
        # ],
    )  
    # Query
    while True:
        q = input("\nEnter a search topic or question (or 'exit'): ").strip()
        if q.lower() in ("exit", "quit"):
            break
        print()
        # Generate the response by querying the engine
        # This performes the similarity search and then applies the prompt
        response = query_engine.query(q)
        # Return the query response and source documents
        print(response.response) 
        print("\nSource documents:")
        for node in response.source_nodes:
            meta = getattr(node, "metadata", None) or node.node.metadata
            print(f"{meta.get('file_name')} {meta.get('file_path')} {getattr(node, 'score', None)}")
 if __name__ == "__main__":
    main()
--- a/archived/query_catalog.py
+++ b/archived/query_catalog.py
@ -0,0 +1,90 @@
 # query.py
 # Run a querry on a vector store
 # This version implements a CATALOG prompt
 #
 # E.M.F. July 2025
 # August 2025 - updated for nd ssearch
 from llama_index.core import (
    StorageContext,
    load_index_from_storage,
    ServiceContext,
    Settings,
 )
 from llama_index.embeddings.huggingface import HuggingFaceEmbedding
 from llama_index.llms.ollama import Ollama
 from llama_index.core.postprocessor import SimilarityPostprocessor
 from llama_index.core.prompts import PromptTemplate
 import logging
 logging.basicConfig(level=logging.DEBUG)
 CATALOG_PROMPT = PromptTemplate(
 """You are a research assistant. You’re given journal snippets (CONTEXT) and a user query.
 Your job is NOT to write an essay but to list the best-matching journal files with a 1–2 sentence rationale.
 Rules:
 - Use only the CONTEXT; do not invent content.
 - Prefer precise references to passages over generalities.
 - Output exactly:
  1) A brief one-line summary of the overall theme you detect.
  2) A bulleted list: **filename** — brief rationale. If available in the snippet, include date or section hints.
 CONTEXT:
 {context_str}
 QUERY: {query_str}
 Now produce the summary line and the bulleted list of matching files."""
 )
 # Use a local model to generate
 Settings.llm = Ollama(
 #    model="llama3.1:8B",    # First model tested
 #    model="deepseek-r1:8B", # This model shows its reasoning
    model="gemma3:1b",
    request_timeout=360.0,
    context_window=8000
    )
 def main():
    # Load embedding model (same as used for vector store)
    embed_model = HuggingFaceEmbedding(model_name="all-mpnet-base-v2")
    Settings.embed_model = embed_model
    # Load persisted vector store + metadata
    storage_context = StorageContext.from_defaults(persist_dir="./storage")
    index = load_index_from_storage(storage_context)
    query_engine = index.as_query_engine(
        similarity_top_k=10,                      # pull wide (tune to taste)
        #response_mode="compact",                  # concise synthesis
        text_qa_template=CATALOG_PROMPT,         # <- custom prompt
        # node_postprocessors=[
        #     SimilarityPostprocessor(similarity_cutoff=0.75)  # keep strong hits; makes result count flexible
        # ],
    )  
    # Query
    while True:
        q = input("\nEnter your question (or 'exit'): ").strip()
        if q.lower() in ("exit", "quit"):
            break
        print()
        response = query_engine.query(q)
        # Return the query response and source documents
        print(response.response) 
        print("\nSource documents:")
        for sn in response.source_nodes:
            meta = getattr(sn, "metadata", None) or sn.node.metadata
            print(meta.get("file_name"), "---", meta.get("file_path"), getattr(sn, "score", None))
 if __name__ == "__main__":
    main()
--- a/archived/query_claude_sonnet.py
+++ b/archived/query_claude_sonnet.py
@ -0,0 +1,223 @@
 #!/usr/bin/env python3
 """
 query_topk_prompt_engine.py
 Query a vector store with a custom prompt for research assistance.
 Uses BAAI/bge-large-en-v1.5 embeddings and Ollama for generation.
 E.M.F. January 2026
 Using Claude Sonnet 4.5 to suggest changes
 """
 import argparse
 import os
 import sys
 from pathlib import Path
 from llama_index.core import (
    Settings,
    StorageContext,
    load_index_from_storage,
 )
 from llama_index.core.prompts import PromptTemplate
 from llama_index.core.postprocessor import SimilarityPostprocessor
 from llama_index.embeddings.huggingface import HuggingFaceEmbedding
 from llama_index.llms.ollama import Ollama
 # Suppress tokenizer parallelism warnings
 os.environ["TOKENIZERS_PARALLELISM"] = "false"
 # Configuration defaults
 DEFAULT_LLM = "command-r7b"
 DEFAULT_EMBED_MODEL = "BAAI/bge-large-en-v1.5"
 DEFAULT_STORAGE_DIR = "./storage_exp"
 DEFAULT_TOP_K = 15
 DEFAULT_SIMILARITY_CUTOFF = 0.7  # Set to None to disable
 def get_prompt_template(max_files: int = 10) -> PromptTemplate:
    """Return the custom prompt template for the query engine."""
    return PromptTemplate(
        f"""You are an expert research assistant. You are given top-ranked writing excerpts (CONTEXT) and a user's QUERY.
 Instructions:
 - Base your response *only* on the CONTEXT.
 - The snippets are ordered from most to least relevant—prioritize insights from earlier (higher-ranked) snippets.
 - Aim to reference *as many distinct* relevant files as possible (up to {max_files}).
 - Do not invent or generalize; refer to specific passages or facts only.
 - If a passage only loosely matches, deprioritize it.
 Format your answer in two parts:
 1. **Summary Theme**  
   Summarize the dominant theme from the relevant context in a few sentences.
 2. **Matching Files**  
   List up to {max_files} matching files. Format each as:  
   <filename> - <rationale tied to content. Include date or section hints if available.>
 CONTEXT:
 {{context_str}}
 QUERY:
 {{query_str}}
 Now provide the theme and list of matching files."""
    )
 def load_models(
    llm_name: str = DEFAULT_LLM,
    embed_model_name: str = DEFAULT_EMBED_MODEL,
    cache_folder: str = "./models",
    request_timeout: float = 360.0,
    context_window: int = 8000,
 ):
    """Initialize and configure the LLM and embedding models."""
    Settings.llm = Ollama(
        model=llm_name,
        request_timeout=request_timeout,
        context_window=context_window,
    )
    Settings.embed_model = HuggingFaceEmbedding(
        cache_folder=cache_folder,
        model_name=embed_model_name,
        local_files_only=True,
    )
 def load_query_engine(
    storage_dir: str = DEFAULT_STORAGE_DIR,
    top_k: int = DEFAULT_TOP_K,
    similarity_cutoff: float | None = DEFAULT_SIMILARITY_CUTOFF,
    max_files: int = 10,
 ):
    """Load the vector store and create a query engine with custom prompt."""
    storage_path = Path(storage_dir)
    if not storage_path.exists():
        raise FileNotFoundError(f"Storage directory not found: {storage_dir}")
    storage_context = StorageContext.from_defaults(persist_dir=str(storage_path))
    index = load_index_from_storage(storage_context)
    # Build postprocessors
    postprocessors = []
    if similarity_cutoff is not None:
        postprocessors.append(SimilarityPostprocessor(similarity_cutoff=similarity_cutoff))
    return index.as_query_engine(
        similarity_top_k=top_k,
        text_qa_template=get_prompt_template(max_files),
        node_postprocessors=postprocessors if postprocessors else None,
    )
 def get_node_metadata(node) -> dict:
    """Safely extract metadata from a source node."""
    # Handle different node structures in llamaindex
    if hasattr(node, "metadata") and node.metadata:
        return node.metadata
    if hasattr(node, "node") and hasattr(node.node, "metadata"):
        return node.node.metadata
    return {}
 def print_results(response):
    """Print the query response and source documents."""
    print("\n" + "=" * 60)
    print("RESPONSE")
    print("=" * 60 + "\n")
    print(response.response)
    print("\n" + "=" * 60)
    print("SOURCE DOCUMENTS")
    print("=" * 60 + "\n")
    for i, node in enumerate(response.source_nodes, 1):
        meta = get_node_metadata(node)
        score = getattr(node, "score", None)
        file_name = meta.get("file_name", "Unknown")
        file_path = meta.get("file_path", "Unknown")
        score_str = f"{score:.3f}" if score is not None else "N/A"
        print(f"{i:2}. [{score_str}] {file_name}")
        print(f"    Path: {file_path}")
 def parse_args():
    """Parse command line arguments."""
    parser = argparse.ArgumentParser(
        description="Query a vector store with a custom research assistant prompt.",
        formatter_class=argparse.RawDescriptionHelpFormatter,
        epilog="""
 Examples:
  python query_topk_prompt_engine.py "What themes appear in the documents?"
  python query_topk_prompt_engine.py --top-k 20 --llm llama3.1:8B "Find references to machine learning"
        """,
    )
    parser.add_argument("query", nargs="+", help="The query text")
    parser.add_argument(
        "--llm",
        default=DEFAULT_LLM,
        help=f"Ollama model to use for generation (default: {DEFAULT_LLM})",
    )
    parser.add_argument(
        "--storage-dir",
        default=DEFAULT_STORAGE_DIR,
        help=f"Path to the vector store (default: {DEFAULT_STORAGE_DIR})",
    )
    parser.add_argument(
        "--top-k",
        type=int,
        default=DEFAULT_TOP_K,
        help=f"Number of similar documents to retrieve (default: {DEFAULT_TOP_K})",
    )
    parser.add_argument(
        "--similarity-cutoff",
        type=float,
        default=DEFAULT_SIMILARITY_CUTOFF,
        help=f"Minimum similarity score (default: {DEFAULT_SIMILARITY_CUTOFF}, use 0 to disable)",
    )
    parser.add_argument(
        "--max-files",
        type=int,
        default=10,
        help="Maximum files to list in response (default: 10)",
    )
    return parser.parse_args()
 def main():
    args = parse_args()
    # Handle similarity cutoff of 0 as "disabled"
    similarity_cutoff = args.similarity_cutoff if args.similarity_cutoff > 0 else None
    try:
        print(f"Loading models (LLM: {args.llm})...")
        load_models(llm_name=args.llm)
        print(f"Loading index from {args.storage_dir}...")
        query_engine = load_query_engine(
            storage_dir=args.storage_dir,
            top_k=args.top_k,
            similarity_cutoff=similarity_cutoff,
            max_files=args.max_files,
        )
        query_text = " ".join(args.query)
        print(f"Querying: {query_text[:100]}{'...' if len(query_text) > 100 else ''}")
        response = query_engine.query(query_text)
        print_results(response)
    except FileNotFoundError as e:
        print(f"Error: {e}", file=sys.stderr)
        sys.exit(1)
    except Exception as e:
        print(f"Error during query: {e}", file=sys.stderr)
        raise
 if __name__ == "__main__":
    main()
--- a/archived/query_exp.py
+++ b/archived/query_exp.py
@ -0,0 +1,106 @@
 # query_topk.py
 # Run a querry on a vector store
 # 
 # This verison implements a prompt and uses the build_exp.py vector store
 # It is based on query_topk.py
 # It uses 10 top-k results and a custom prompt
 # The next version after this is query_rewrite.py
 # build_exp.py modifies the chunk size and overlap form the orignal build.py
 #
 # E.M.F. August 2025
 from llama_index.core import (
    StorageContext,
    load_index_from_storage,
    ServiceContext,
    Settings,
 )
 from llama_index.embeddings.huggingface import HuggingFaceEmbedding
 from llama_index.llms.ollama import Ollama
 from llama_index.core.prompts import PromptTemplate
 # LLM model to use in query transform and generation
 llm="llama3.1:8B"
 # Other models tried:
 # llm="deepseek-r1:8B"
 # llm="gemma3:1b"
 # Custom prompt for the query engine
 PROMPT = PromptTemplate(
 """You are an expert research assistant. You are given top-ranked journal excerpts (CONTEXT) and a user’s QUERY.
 Instructions:
 - Base your response *only* on the CONTEXT.
 - The snippets are ordered from most to least relevant—prioritize insights from earlier (higher-ranked) snippets.
 - Aim to reference *as many distinct* relevant files as possible (up to 10).
 - Do not invent or generalize; refer to specific passages or facts only.
 - If a passage only loosely matches, deprioritize it.
 Format your answer in two parts:
 1. **Summary Theme**  
   Summarize the dominant theme from the relevant context.
 2. **Matching Files**  
   Make a bullet list of 10. The format for each should be:  
   **<filename>** — <rationale tied to content. Include date or section hints if available.>
 CONTEXT:
 {context_str}
 QUERY:
 {query_str}
 Now provide the theme and list of matching files."""
 )
 #
 # Main program routine
 #
 def main():
    # Use a local model to generate
    Settings.llm = Ollama(
        model=llm,    # First model tested
        request_timeout=360.0,
        context_window=8000
        )
    # Load embedding model (same as used for vector store)
    embed_model = HuggingFaceEmbedding(model_name="all-mpnet-base-v2")
    Settings.embed_model = embed_model
    # Load persisted vector store + metadata
    storage_context = StorageContext.from_defaults(persist_dir="./storage_exp")
    index = load_index_from_storage(storage_context)
    # Build regular query engine with custom prompt
    query_engine = index.as_query_engine(
        similarity_top_k=10,            # pull wide 
        #response_mode="compact"        # concise synthesis
        text_qa_template=PROMPT,        # custom prompt
        # node_postprocessors=[
        #     SimilarityPostprocessor(similarity_cutoff=0.75)  # keep strong hits; makes result count flexible
        # ],
    )  
    # Query
    while True:
        q = input("\nEnter your question (or 'exit'): ").strip()
        if q.lower() in ("exit", "quit"):
            break
        print()
        response = query_engine.query(q)
        # Return the query response and source documents
        print(response.response) 
        print("\nSource documents:")
        for node in response.source_nodes:
            meta = getattr(node, "metadata", None) or node.node.metadata
            print(meta.get("file_name"), "---", meta.get("file_path"), getattr(node, "score", None))
 if __name__ == "__main__":
    main()
--- a/archived/query_multitool.py
+++ b/archived/query_multitool.py
@ -0,0 +1,106 @@
 """
 This is output generated by ChatG to implement a new regex + vector search engine
 """
 from __future__ import annotations
 from typing import List, Iterable
 import json, re
 from llama_index.core import VectorStoreIndex, Settings
 from llama_index.core.node_parser import SentenceSplitter
 from llama_index.core.schema import NodeWithScore, QueryBundle
 from llama_index.core.retrievers import BaseRetriever, EnsembleRetriever
 from llama_index.core.query_engine import RetrieverQueryEngine
 from llama_index.core import Document
 # 0) Configure your LLM + embeddings up front
 # Example: Settings.llm = <your Command-R wrapper> ; Settings.embed_model = <your embeddings>
 # (You can also pass an llm explicitly into the retriever if you prefer.)
 # Settings.llm.complete("hello") should work in v0.10+
 # 1) Prepare nodes once (so regex + vector share the same chunks)
 def build_nodes(docs: List[Document], chunk_size: int = 1024, overlap: int = 100):
    splitter = SentenceSplitter(chunk_size=chunk_size, chunk_overlap=overlap)
    return splitter.get_nodes_from_documents(docs)
 # 2) LLM-guided regex retriever
 class RegexRetriever(BaseRetriever):
    def __init__(self, nodes: Iterable, llm=None, top_k: int = 5, flags=re.IGNORECASE):
        super().__init__()
        self._nodes = list(nodes)
        self._llm = llm or Settings.llm
        self._top_k = top_k
        self._flags = flags
    def _extract_terms(self, query: str) -> List[str]:
        """Ask the LLM for up to ~6 distinctive keywords/short phrases. Return a list of strings."""
        prompt = f"""
 You extract search terms for a boolean/regex search.
 Query: {query}
 Rules:
 - Return ONLY a JSON array of strings.
 - Use up to 6 concise keywords/short phrases.
 - Keep phrases short (<= 3 words).
 - Avoid stopwords, punctuation, and generic terms.
 - No explanations, no extra text.
 """
        raw = self._llm.complete(prompt).text.strip()
        try:
            terms = json.loads(raw)
            # basic sanitize
            terms = [t for t in terms if isinstance(t, str) and t.strip()]
        except Exception:
            # simple fall-back if JSON parse fails
            terms = [w for w in re.findall(r"\w+", query) if len(w) > 2][:6]
        return terms[:6]
    def _compile_patterns(self, terms: List[str]) -> List[re.Pattern]:
        pats = []
        for t in terms:
            # Escape user/LLM output, add word boundaries; allow whitespace inside short phrases
            escaped = re.escape(t)
            # turn '\ ' (escaped space) back into '\s+' to match any whitespace in phrases
            escaped = escaped.replace(r"\ ", r"\s+")
            pats.append(re.compile(rf"\b{escaped}\b", self._flags))
        return pats
    def _retrieve(self, query_bundle: QueryBundle) -> List[NodeWithScore]:
        terms = self._extract_terms(query_bundle.query_str)
        patterns = self._compile_patterns(terms)
        scored: List[tuple] = []
        for n in self._nodes:
            txt = n.get_content(metadata_mode="all")
            hits = 0
            for p in patterns:
                if p.search(txt):
                    hits += 1
            if hits:
                # simple score = number of distinct term hits (you can weight phrase vs single word if you like)
                scored.append((n, float(hits)))
        scored.sort(key=lambda x: x[1], reverse=True)
        return [NodeWithScore(node=n, score=s) for n, s in scored[: self._top_k]]
 # 3) Wire it all together
 def build_query_engine(docs: List[Document], k_vec=5, k_regex=5, weights=(0.7, 0.3)):
    nodes = build_nodes(docs)
    # Vector index over the SAME nodes
    vindex = VectorStoreIndex(nodes)
    vector_ret = vindex.as_retriever(similarity_top_k=k_vec)
    regex_ret = RegexRetriever(nodes, top_k=k_regex)
    ensemble = EnsembleRetriever(
        retrievers=[vector_ret, regex_ret],
        weights=list(weights),       # tune this: more recall from regex? bump weight on regex
        # uses Reciprocal Rank Fusion by default
    )
    return RetrieverQueryEngine(retriever=ensemble)
 # 4) Use it
 # docs = SimpleDirectoryReader("data").load_data()
 # qe = build_query_engine(docs)
 # print(qe.query("Find entries with strong feelings of depression."))
--- a/archived/query_rewrite_hyde.py
+++ b/archived/query_rewrite_hyde.py
@ -0,0 +1,126 @@
 # query_rewrite_hyde.py
 # Run a querry on a vector store
 # 
 # Latest experiment to include query rewriting using HyDE (Hypothetial Document Embeddings)
 # The goal is to reduce the semantic gap between the query and the indexed documents
 # This verison implements a prompt and uses the build_exp.py vector store
 # Based on query_exp.py
 #
 # E.M.F. July 2025
 from llama_index.core import (
    StorageContext,
    load_index_from_storage,
    ServiceContext,
    Settings,
 )
 from llama_index.embeddings.huggingface import HuggingFaceEmbedding
 from llama_index.llms.ollama import Ollama
 from llama_index.core.prompts import PromptTemplate
 from llama_index.core.indices.query.query_transform import HyDEQueryTransform
 from llama_index.core.query_engine.transform_query_engine import TransformQueryEngine
 import os
 # Globals
 # Embedding model used in vector store (this should match the one in build_exp.py or equivalent)
 # embed_model = HuggingFaceEmbedding(model_name="all-mpnet-base-v2")
 embed_model = HuggingFaceEmbedding(cache_folder="./models",model_name="BAAI/bge-large-en-v1.5")
 os.environ["TOKENIZERS_PARALLELISM"] = "false"
 # LLM model to use in query transform and generation
 llm="llama3.1:8B"
 # Other models tried:
 # llm="deepseek-r1:8B"
 # llm="gemma3:1b"
 # Custom prompt for the query engine
 PROMPT = PromptTemplate(
 """You are an expert research assistant. You are given top-ranked writing excerpts (CONTEXT) and a user's QUERY.
 Instructions:
 - Base your response *only* on the CONTEXT.
 - The snippets are ordered from most to least relevant—prioritize insights from earlier (higher-ranked) snippets.
 - Aim to reference *as many distinct* relevant files as possible (up to 10).
 - Do not invent or generalize; refer to specific passages or facts only.
 - If a passage only loosely matches, deprioritize it.
 Format your answer in two parts:
 1. **Summary Theme**  
   Summarize the dominant theme from the relevant context in a few sentences.
 2. **Matching Files**  
   Make a list of 10 matching files. The format for each should be:  
   <filename> — <rationale tied to content. Include date or section hints if available.>
 CONTEXT:
 {context_str}
 QUERY:
 {query_str}
 Now provide the theme and list of matching files."""
 )
 #
 # Main program routine
 #
 def main():
    # Use a local model to generate
    Settings.llm = Ollama(
        model=llm,    # First model tested
        request_timeout=360.0,
        context_window=8000
        )
    # Load embedding model (same as used for vector store)
    Settings.embed_model = embed_model
    # Load persisted vector store + metadata
    storage_context = StorageContext.from_defaults(persist_dir="./storage_exp")
    index = load_index_from_storage(storage_context)
    # Build regular query engine with custom prompt
    base_query_engine = index.as_query_engine(
        similarity_top_k=15,            # pull wide 
        #response_mode="compact"        # concise synthesis
        text_qa_template=PROMPT,        # custom prompt
        # node_postprocessors=[
        #     SimilarityPostprocessor(similarity_cutoff=0.75)  # keep strong hits; makes result count flexible
        # ],
    )  
    # HyDE is "Hypothetical Document Embeddings"
    # It generates a hypothetical document based on the query
    # and uses that to augment the query
    # Here we include the original query as well
    # I get better similarity values with include_orignal=True
    hyde_transform = HyDEQueryTransform(llm=Settings.llm,include_original=True)
    # Query
    while True:
        q = input("\nEnter a search topic or question (or 'exit'): ").strip()
        if q.lower() in ("exit", "quit"):
            break
        print()
        # The query uses a HyDE trasformation to rewrite the query
        query_engine = TransformQueryEngine(base_query_engine, query_transform=hyde_transform)
        # Generate the response by querying the engine
        # This performes the similarity search and then applies the prompt
        response = query_engine.query(q)
        # Return the query response and source documents
        print(response.response) 
        print("\nSource documents:")
        for node in response.source_nodes:
            meta = getattr(node, "metadata", None) or node.node.metadata
            print(meta.get("file_name"), "---", meta.get("file_path"), getattr(node, "score", None))
 if __name__ == "__main__":
    main()
--- a/archived/query_topk.py
+++ b/archived/query_topk.py
@ -0,0 +1,58 @@
 # query_topk.py
 # Run a querry on a vector store
 #
 # E.M.F. July 2025
 # August 2025 - updated for nd ssearch
 # this version uses top-k similarity
 from llama_index.core import (
    StorageContext,
    load_index_from_storage,
    ServiceContext,
    Settings,
 )
 from llama_index.embeddings.huggingface import HuggingFaceEmbedding
 from llama_index.llms.ollama import Ollama
 # Use a local model to generate
 Settings.llm = Ollama(
    model="llama3.1:8B",    # First model tested
 #    model="deepseek-r1:8B", # This model shows its reasoning
 #    model="gemma3:1b",
    request_timeout=360.0,
    context_window=8000
    )
 def main():
    # Load embedding model (same as used for vector store)
    embed_model = HuggingFaceEmbedding(model_name="all-mpnet-base-v2")
    Settings.embed_model = embed_model
    # Load persisted vector store + metadata
    storage_context = StorageContext.from_defaults(persist_dir="./storage")
    index = load_index_from_storage(storage_context)
    query_engine = index.as_query_engine(similarity_top_k=5)
    # Query
    while True:
        q = input("\nEnter your question (or 'exit'): ").strip()
        if q.lower() in ("exit", "quit"):
            break
        print()
        response = query_engine.query(q)
        # Return the query response and source documents
        print(response.response) 
        print("\nSource documents:")
        for node in response.source_nodes:
            meta = getattr(node, "metadata", None) or node.node.metadata
            print(meta.get("file_name"), "---", meta.get("file_path"), getattr(node, "score", None))
 if __name__ == "__main__":
    main()
--- a/archived/query_topk_prompt.py
+++ b/archived/query_topk_prompt.py
@ -0,0 +1,123 @@
 # query_topk_prompt.py
 # Run a querry on a vector store
 # 
 # This version from query_rewrite_hyde.py, but removing hyde and using a custom prompt
 # This verison implements a prompt and uses the build_exp.py vector store with BAAI/bge-large-en-v1.5
 # Based on query_exp.py->query_topk.py->query_rewrite_hyde.py
 # The results are as good as with HyDE.
 #
 # E.M.F. August 2025
 from llama_index.core import (
    StorageContext,
    load_index_from_storage,
    ServiceContext,
    Settings,
 )
 from llama_index.embeddings.huggingface import HuggingFaceEmbedding
 from llama_index.llms.ollama import Ollama
 from llama_index.core.prompts import PromptTemplate
 import os
 #
 # Globals
 #
 os.environ["TOKENIZERS_PARALLELISM"] = "false"
 # Embedding model used in vector store (this should match the one in build_exp.py or equivalent)
 # embed_model = HuggingFaceEmbedding(model_name="all-mpnet-base-v2")
 embed_model = HuggingFaceEmbedding(cache_folder="./models",model_name="BAAI/bge-large-en-v1.5")
 # LLM model to use in query transform and generation
 # command-r7b generates about as quickly as llama3.1:8B, but provides results that stick better
 # to the provided context
 llm="command-r7b"
 # Other models tried:
 #llm="llama3.1:8B"
 #llm="deepseek-r1:8B"
 #llm="gemma3:1b"
 #
 # Custom prompt for the query engine
 #
 PROMPT = PromptTemplate(
 """You are an expert research assistant. You are given top-ranked writing excerpts (CONTEXT) and a user's QUERY.
 Instructions:
 - Base your response *only* on the CONTEXT.
 - The snippets are ordered from most to least relevant—prioritize insights from earlier (higher-ranked) snippets.
 - Aim to reference *as many distinct* relevant files as possible (up to 10).
 - Do not invent or generalize; refer to specific passages or facts only.
 - If a passage only loosely matches, deprioritize it.
 Format your answer in two parts:
 1. **Summary Theme**  
   Summarize the dominant theme from the relevant context in a few sentences.
 2. **Matching Files**  
   Make a list of 10 matching files. The format for each should be:  
   <filename> - 
   <rationale tied to content. Include date or section hints if available.>
 CONTEXT:
 {context_str}
 QUERY:
 {query_str}
 Now provide the theme and list of matching files."""
 )
 #
 # Main program routine
 #
 def main():
    # Use a local model to generate -- in this case using Ollama
    Settings.llm = Ollama(
        model=llm,    # First model tested
        request_timeout=360.0,
        context_window=8000
        )
    # Load embedding model (same as used for vector store)
    Settings.embed_model = embed_model
    # Load persisted vector store + metadata
    storage_context = StorageContext.from_defaults(persist_dir="./storage_exp")
    index = load_index_from_storage(storage_context)
    # Build regular query engine with custom prompt
    query_engine = index.as_query_engine(
        similarity_top_k=15,            # pull wide 
        #response_mode="compact"        # concise synthesis
        text_qa_template=PROMPT,        # custom prompt
        # node_postprocessors=[
        #     SimilarityPostprocessor(similarity_cutoff=0.75)  # keep strong hits; makes result count flexible
        # ],
    )  
    # Query
    while True:
        q = input("\nEnter a search topic or question (or 'exit'): ").strip()
        if q.lower() in ("exit", "quit"):
            break
        print()
        # Generate the response by querying the engine
        # This performes the similarity search and then applies the prompt
        response = query_engine.query(q)
        # Return the query response and source documents
        print(response.response) 
        print("\nSource documents:")
        for node in response.source_nodes:
            meta = getattr(node, "metadata", None) or node.node.metadata
            print(f"{meta.get('file_name')} {meta.get('file_path')} {getattr(node, 'score', None)}")
 if __name__ == "__main__":
    main()
--- a/archived/query_topk_prompt_dw.py
+++ b/archived/query_topk_prompt_dw.py
@ -0,0 +1,134 @@
 # query_topk_prompt_dw.py
 # Run a querry on a vector store
 # 
 # This version from query_rewrite_hyde.py, but removing hyde and using a custom prompt
 # This verison implements a prompt and uses the build_exp.py vector store with BAAI/bge-large-en-v1.5
 # Based on query_exp.py->query_topk.py->query_rewrite_hyde.py
 # The results are as good as with HyDE.
 # Modified for terminal output (132 columns)
 #
 # E.M.F. August 2025
 from llama_index.core import (
    StorageContext,
    load_index_from_storage,
    ServiceContext,
    Settings,
 )
 from llama_index.embeddings.huggingface import HuggingFaceEmbedding
 from llama_index.llms.ollama import Ollama
 from llama_index.core.prompts import PromptTemplate
 import os
 import sys
 import textwrap
 # Print wrapping for terminal output
 class Wrap80:
    def write(self, text):
        for line in text.splitlines():
            sys.__stdout__.write(textwrap.fill(line, width=131) + "\n")
    def flush(self):
        sys.__stdout__.flush()
 sys.stdout = Wrap80()
 #
 # Globals
 #
 # Embedding model used in vector store (this should match the one in build_exp.py or equivalent)
 # embed_model = HuggingFaceEmbedding(model_name="all-mpnet-base-v2")
 embed_model = HuggingFaceEmbedding(cache_folder="./models",model_name="BAAI/bge-large-en-v1.5")
 os.environ["TOKENIZERS_PARALLELISM"] = "false"
 # LLM model to use in query transform and generation
 # command-r7b generates about as quickly as llama3.1:8B, but provides results that stick better
 # to the provided context
 llm="command-r7b"
 # Other models tried:
 #llm="llama3.1:8B"
 # llm="deepseek-r1:8B"
 # llm="gemma3:1b"
 # Custom prompt for the query engine
 PROMPT = PromptTemplate(
 """You are an expert research assistant. You are given top-ranked writing excerpts (CONTEXT) and a user's QUERY.
 Instructions:
 - Base your response *only* on the CONTEXT.
 - The snippets are ordered from most to least relevant—prioritize insights from earlier (higher-ranked) snippets.
 - Aim to reference *as many distinct* relevant files as possible (up to 10).
 - Do not invent or generalize; refer to specific passages or facts only.
 - If a passage only loosely matches, deprioritize it.
 Format your answer in two parts:
 1. **Summary Theme**  
   Summarize the dominant theme from the relevant context in a few sentences.
 2. **Matching Files**  
   Make a list of 10 matching files. The format for each should be:  
   <filename> - 
   <rationale tied to content. Include date or section hints if available.>
 CONTEXT:
 {context_str}
 QUERY:
 {query_str}
 Now provide the theme and list of matching files."""
 )
 #
 # Main program routine
 #
 def main():
    # Use a local model to generate
    Settings.llm = Ollama(
        model=llm,    # First model tested
        request_timeout=360.0,
        context_window=8000
        )
    # Load embedding model (same as used for vector store)
    Settings.embed_model = embed_model
    # Load persisted vector store + metadata
    storage_context = StorageContext.from_defaults(persist_dir="./storage_exp")
    index = load_index_from_storage(storage_context)
    # Build regular query engine with custom prompt
    query_engine = index.as_query_engine(
        similarity_top_k=15,            # pull wide 
        #response_mode="compact"        # concise synthesis
        text_qa_template=PROMPT,        # custom prompt
        # node_postprocessors=[
        #     SimilarityPostprocessor(similarity_cutoff=0.75)  # keep strong hits; makes result count flexible
        # ],
    )  
    # Query
    while True:
        q = input("\nEnter a search topic or question (or 'exit'): ").strip()
        if q.lower() in ("exit", "quit"):
            break
        print()
        # Generate the response by querying the engine
        # This performes the similarity search and then applies the prompt
        response = query_engine.query(q)
        # Return the query response and source documents
        print(response.response) 
        print("\nSource documents:")
        for node in response.source_nodes:
            meta = getattr(node, "metadata", None) or node.node.metadata
            print(f"{meta.get('file_name')} {meta.get('file_path')} {getattr(node, 'score', None)}", end="")
 if __name__ == "__main__":
    main()
--- a/archived/query_topk_prompt_engine.py
+++ b/archived/query_topk_prompt_engine.py
@ -0,0 +1,123 @@
 # query_topk_prompt_engine.py
 # Run a querry on a vector store
 # 
 # This version is query_topk_prompt.py but the query is passed though the command line.
 #
 # Implements a prompt and uses the build_exp.py vector store with BAAI/bge-large-en-v1.5
 # Based on query_exp.py->query_topk.py
 #
 # E.M.F. August 2025
 from llama_index.core import (
    StorageContext,
    load_index_from_storage,
    ServiceContext,
    Settings,
 )
 from llama_index.embeddings.huggingface import HuggingFaceEmbedding
 from llama_index.llms.ollama import Ollama
 from llama_index.core.prompts import PromptTemplate
 import os
 import sys
 #
 # Globals
 #
 os.environ["TOKENIZERS_PARALLELISM"] = "false"
 # Embedding model used in vector store (this should match the one in build_exp.py or equivalent)
 # embed_model = HuggingFaceEmbedding(model_name="all-mpnet-base-v2")
 embed_model = HuggingFaceEmbedding(cache_folder="./models",model_name="BAAI/bge-large-en-v1.5",local_files_only=True)
 # LLM model to use in query transform and generation
 # command-r7b generates about as quickly as llama3.1:8B, but provides results that stick better
 # to the provided context
 llm="command-r7b"
 # Other models tried:
 #llm="llama3.1:8B"
 #llm="deepseek-r1:8B"
 #llm="gemma3:1b"
 #
 # Custom prompt for the query engine
 #
 PROMPT = PromptTemplate(
 """You are an expert research assistant. You are given top-ranked writing excerpts (CONTEXT) and a user's QUERY.
 Instructions:
 - Base your response *only* on the CONTEXT.
 - The snippets are ordered from most to least relevant—prioritize insights from earlier (higher-ranked) snippets.
 - Aim to reference *as many distinct* relevant files as possible (up to 10).
 - Do not invent or generalize; refer to specific passages or facts only.
 - If a passage only loosely matches, deprioritize it.
 Format your answer in two parts:
 1. **Summary Theme**  
   Summarize the dominant theme from the relevant context in a few sentences.
 2. **Matching Files**  
   Make a list of 10 matching files. The format for each should be:  
   <filename> - 
   <rationale tied to content. Include date or section hints if available.>
 CONTEXT:
 {context_str}
 QUERY:
 {query_str}
 Now provide the theme and list of matching files."""
 )
 #
 # Main program routine
 #
 def main():
    # Use a local model to generate -- in this case using Ollama
    Settings.llm = Ollama(
        model=llm,    # First model tested
        request_timeout=360.0,
        context_window=8000
        )
    # Load embedding model (same as used for vector store)
    Settings.embed_model = embed_model
    # Load persisted vector store + metadata
    storage_context = StorageContext.from_defaults(persist_dir="./storage_exp")
    index = load_index_from_storage(storage_context)
    # Build regular query engine with custom prompt
    query_engine = index.as_query_engine(
        similarity_top_k=15,            # pull wide 
        #response_mode="compact"        # concise synthesis
        text_qa_template=PROMPT,        # custom prompt
        # node_postprocessors=[
        #     SimilarityPostprocessor(similarity_cutoff=0.75)  # keep strong hits; makes result count flexible
        # ],
    )  
    # Query
    if len(sys.argv) < 2:
        print("Usage: python query.py QUERY_TEXT")
        sys.exit(1)
    q = " ".join(sys.argv[1:])    
    # Generate the response by querying the engine
    # This performes the similarity search and then applies the prompt
    response = query_engine.query(q)
    # Return the query response and source documents
    print("\nResponse:\n")
    print(response.response) 
    print("\nSource documents:")
    for node in response.source_nodes:
        meta = getattr(node, "metadata", None) or node.node.metadata
        print(f"{meta.get('file_name')}  {meta.get('file_path')}  {getattr(node, 'score', None):.3f}")
 if __name__ == "__main__":
    main()
--- a/archived/query_tree.py
+++ b/archived/query_tree.py
@ -0,0 +1,60 @@
 # query_tree.py
 #
 # Run a querry on a vector store
 # This is to test summarization using a tree-summarize response mode
 # It doesn't work very well, perhaps because of the struture of the data
 #  
 # E.M.F. August 2025
 from llama_index.core import (
    StorageContext,
    load_index_from_storage,
    ServiceContext,
    Settings,
 )
 from llama_index.embeddings.huggingface import HuggingFaceEmbedding
 from llama_index.llms.ollama import Ollama
 # Use a local model to generate
 Settings.llm = Ollama(
    model="llama3.1:8B",    # First model tested
 #    model="deepseek-r1:8B", # This model shows its reasoning
 #    model="gemma3:1b",
    request_timeout=360.0,
    context_window=8000
    )
 def main():
    # Load embedding model (same as used for vector store)
    embed_model = HuggingFaceEmbedding(model_name="all-mpnet-base-v2")
    Settings.embed_model = embed_model
    # Load persisted vector store + metadata
    storage_context = StorageContext.from_defaults(persist_dir="./storage")
    index = load_index_from_storage(storage_context)
    query_engine = index.as_query_engine(response_mode="tree_summarize")
    # Query
    while True:
        q = input("\nEnter your question (or 'exit'): ").strip()
        if q.lower() in ("exit", "quit"):
            break
        print()
        response = query_engine.query("<summarization_query>")
        # Return the query response and source documents
        print(response.response) 
        print("\nSource documents:")
        for node in response.source_nodes:
            meta = getattr(node, "metadata", None) or node.node.metadata
            print(meta.get("file_name"), "---", meta.get("file_path"), getattr(node, "score", None))
 if __name__ == "__main__":
    main()
--- a/archived/vs_metrics.py
+++ b/archived/vs_metrics.py
@ -0,0 +1,27 @@
 # vs_metrics.py
 # Quantify vector store properties and performance
 #
 # E.M.F. August 2025
 # Read in vector store
 # What are properties of the vector store?
 # - number of vectors
 # - distribution of distances
 # - clustering?
 from llama_index.core import (
    StorageContext,
    load_index_from_storage,
    ServiceContext,
    Settings,
 )
 from llama_index.embeddings.huggingface import HuggingFaceEmbedding
 # Load embedding model (same as used for vector store)
 embed_model = HuggingFaceEmbedding(model_name="all-mpnet-base-v2")
 Settings.embed_model = embed_model
 # Load persisted vector store + metadata
 storage_context = StorageContext.from_defaults(persist_dir="./storage")
 index = load_index_from_storage(storage_context)
--- a/build_exp_claude.py
+++ b/build_exp_claude.py
@ -0,0 +1,190 @@
 # build_exp_claude.py
 #
 # Build or update the vector store from journal entries in ./data.
 #
 # Default mode (incremental): loads the existing index and adds only
 # new or modified files.  Use --rebuild for a full rebuild from scratch.
 #
 # January 2026
 # E. M. Furst
 # Used Sonnet 4.5 to suggest changes; Opus 4.6 for incremental update
 from llama_index.core import (
    SimpleDirectoryReader,
    StorageContext,
    VectorStoreIndex,
    load_index_from_storage,
    Settings,
 )
 from pathlib import Path
 from llama_index.embeddings.huggingface import HuggingFaceEmbedding
 from llama_index.core.node_parser import SentenceSplitter
 import argparse
 import datetime
 import os
 import time
 # Shared constants
 DATA_DIR = Path("./data")
 PERSIST_DIR = "./storage_exp"
 EMBED_MODEL_NAME = "BAAI/bge-large-en-v1.5"
 CHUNK_SIZE = 256
 CHUNK_OVERLAP = 25
 def get_text_splitter():
    return SentenceSplitter(
        chunk_size=CHUNK_SIZE,
        chunk_overlap=CHUNK_OVERLAP,
        paragraph_separator="\n\n",
    )
 def rebuild():
    """Full rebuild: delete and recreate the vector store from scratch."""
    if not DATA_DIR.exists():
        raise FileNotFoundError(f"Data directory not found: {DATA_DIR.absolute()}")
    print(f"Loading documents from {DATA_DIR.absolute()}...")
    documents = SimpleDirectoryReader(str(DATA_DIR)).load_data()
    if not documents:
        raise ValueError("No documents found in data directory")
    print(f"Loaded {len(documents)} document(s)")
    print("Building vector index...")
    index = VectorStoreIndex.from_documents(
        documents,
        transformations=[get_text_splitter()],
        show_progress=True,
    )
    index.storage_context.persist(persist_dir=PERSIST_DIR)
    print(f"Index built and saved to {PERSIST_DIR}")
 def update():
    """Incremental update: add new files, re-index modified files, remove deleted files."""
    if not DATA_DIR.exists():
        raise FileNotFoundError(f"Data directory not found: {DATA_DIR.absolute()}")
    # Load existing index
    print(f"Loading existing index from {PERSIST_DIR}...")
    storage_context = StorageContext.from_defaults(persist_dir=PERSIST_DIR)
    index = load_index_from_storage(storage_context)
    # Set transformations so index.insert() chunks correctly
    Settings.transformations = [get_text_splitter()]
    # Build lookup of indexed files: file_name -> (ref_doc_id, metadata)
    all_ref_docs = index.docstore.get_all_ref_doc_info()
    indexed = {}
    for ref_id, info in all_ref_docs.items():
        fname = info.metadata.get("file_name")
        if fname:
            indexed[fname] = (ref_id, info.metadata)
    print(f"Index contains {len(indexed)} documents")
    # Scan current files on disk
    disk_files = {f.name: f for f in sorted(DATA_DIR.glob("*.txt"))}
    print(f"Data directory contains {len(disk_files)} files")
    # Classify files
    new_files = []
    modified_files = []
    deleted_files = []
    unchanged = 0
    for fname, fpath in disk_files.items():
        if fname not in indexed:
            new_files.append(fpath)
        else:
            ref_id, meta = indexed[fname]
            # Compare file size and modification date
            stat = fpath.stat()
            disk_size = stat.st_size
            # Must use UTC to match SimpleDirectoryReader's date format
            disk_mdate = datetime.datetime.fromtimestamp(
                stat.st_mtime, tz=datetime.timezone.utc
            ).strftime("%Y-%m-%d")
            stored_size = meta.get("file_size")
            stored_mdate = meta.get("last_modified_date")
            if disk_size != stored_size or disk_mdate != stored_mdate:
                modified_files.append((fpath, ref_id))
            else:
                unchanged += 1
    for fname, (ref_id, meta) in indexed.items():
        if fname not in disk_files:
            deleted_files.append((fname, ref_id))
    # Report
    print(f"\n  New:       {len(new_files)}")
    print(f"  Modified:  {len(modified_files)}")
    print(f"  Deleted:   {len(deleted_files)}")
    print(f"  Unchanged: {unchanged}")
    if not new_files and not modified_files and not deleted_files:
        print("\nNothing to do.")
        return
    # Process deletions (including modified files that need re-indexing)
    for fname, ref_id in deleted_files:
        print(f"  Removing {fname}")
        index.delete_ref_doc(ref_id, delete_from_docstore=True)
    for fpath, ref_id in modified_files:
        print(f"  Re-indexing {fpath.name} (modified)")
        index.delete_ref_doc(ref_id, delete_from_docstore=True)
    # Process additions (new files + modified files)
    files_to_add = new_files + [fpath for fpath, _ in modified_files]
    if files_to_add:
        print(f"\nIndexing {len(files_to_add)} file(s)...")
        docs = SimpleDirectoryReader(input_files=[str(f) for f in files_to_add]).load_data()
        for doc in docs:
            index.insert(doc)
    # Persist
    index.storage_context.persist(persist_dir=PERSIST_DIR)
    print(f"\nIndex updated and saved to {PERSIST_DIR}")
 def main():
    parser = argparse.ArgumentParser(
        description="Build or update the vector store from journal entries."
    )
    parser.add_argument(
        "--rebuild",
        action="store_true",
        help="Full rebuild from scratch (default: incremental update)",
    )
    args = parser.parse_args()
    # Configure embedding model
    embed_model = HuggingFaceEmbedding(model_name=EMBED_MODEL_NAME)
    Settings.embed_model = embed_model
    start = time.time()
    if args.rebuild:
        print("Mode: full rebuild")
        rebuild()
    else:
        print("Mode: incremental update")
        if not Path(PERSIST_DIR).exists():
            print(f"No existing index at {PERSIST_DIR}, doing full rebuild.")
            rebuild()
        else:
            update()
    elapsed = time.time() - start
    print(f"Done in {elapsed:.1f}s")
 if __name__ == "__main__":
    main()
--- a/devlog.txt
+++ b/devlog.txt
--- a/hyde.ipynb
+++ b/hyde.ipynb
@ -0,0 +1,249 @@
 {
 "cells": [
  {
   "cell_type": "markdown",
   "id": "11d5ae50",
   "metadata": {},
   "source": [
    "# Experimenting with HyDE\n",
    "\n",
    "Using this to explore query rewrites\\\n",
    "August 2025"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "813f8b1a",
   "metadata": {},
   "outputs": [],
   "source": [
    "from llama_index.core import (\n",
    "    StorageContext,\n",
    "    load_index_from_storage,\n",
    "    ServiceContext,\n",
    "    Settings,\n",
    ")\n",
    "from llama_index.embeddings.huggingface import HuggingFaceEmbedding\n",
    "from llama_index.llms.ollama import Ollama\n",
    "from llama_index.core.prompts import PromptTemplate\n",
    "from llama_index.core.indices.query.query_transform import HyDEQueryTransform\n",
    "from llama_index.core.query_engine.transform_query_engine import TransformQueryEngine"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "id": "f3d65589",
   "metadata": {},
   "outputs": [],
   "source": [
    "llm=\"llama3.1:8B\"\n",
    "\n",
    "# Use a local model to generate\n",
    "Settings.llm = Ollama(\n",
    "    model=llm,    # First model tested\n",
    "    request_timeout=360.0,\n",
    "    context_window=8000,\n",
    "    temperature=0.7,\n",
    "    )\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "id": "afd593ee",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Load embedding model (same as used for vector store)\n",
    "embed_model = HuggingFaceEmbedding(model_name=\"all-mpnet-base-v2\")\n",
    "Settings.embed_model = embed_model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 52,
   "id": "04c702a2",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Original query: Find entries with strong feelings of depression.\n",
      "HyDE-generated query (used for embedding):\n",
      " Find entries with strong feelings of depression.\n"
     ]
    }
   ],
   "source": [
    "#Initial query\n",
    "initial_query = \"Find entries with strong feelings of depression.\"\n",
    "\n",
    "# Define a custom HyDE prompt (this is fully supported)\n",
    "hyde_prompt = PromptTemplate(\n",
    "    \"You are a helpful assistant. Generate a detailed hypothetical answer to the user query below.\\n\\nQuery: {query_str}\\n\\nAnswer:\"\n",
    ")\n",
    "\n",
    "hyde_transform = HyDEQueryTransform(llm=Settings.llm,hyde_prompt=hyde_prompt,include_original=False)\n",
    "\n",
    "# Run the transform manually\n",
    "hyde_query = hyde_transform.run(initial_query)\n",
    "\n",
    "# Print the result\n",
    "print(\"Original query:\", initial_query)\n",
    "print(\"HyDE-generated query (used for embedding):\\n\", hyde_query.query_str)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 53,
   "id": "3b211daf",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "There are many important feelings that people experience in their lives. Here are some examples:\n",
      "\n",
      "1. **Love**: A strong affection or attachment to someone, which can be romantic, familial, or platonic.\n",
      "2. **Happiness**: A positive emotional state characterized by a sense of joy, contentment, and satisfaction.\n",
      "3. **Empathy**: The ability to understand and share the feelings of others, which is essential for building strong relationships and fostering compassion.\n",
      "4. **Gratitude**: Feeling thankful or appreciative for something or someone in one's life, which can cultivate a positive outlook and well-being.\n",
      "5. **Compassion**: A feeling of concern and kindness towards others who are suffering or struggling, which can inspire acts of service and support.\n",
      "6. **Confidence**: A sense of self-assurance and faith in one's abilities, which is essential for personal growth and achievement.\n",
      "7. **Respect**: Feeling admiration or esteem for someone or something, which is necessary for building strong relationships and social bonds.\n",
      "8. **Forgiveness**: The ability to let go of negative emotions and forgive oneself or others for past mistakes or hurtful actions.\n",
      "9. **Excitement**: A feeling of enthusiasm and eagerness, often accompanied by a sense of anticipation or adventure.\n",
      "10. **Serenity**: A state of calmness and peace, which can be cultivated through mindfulness and self-reflection.\n",
      "\n",
      "These feelings are essential for human well-being and relationships, and they play important roles in shaping our experiences and interactions with others.\n",
      "\n",
      "Would you like me to expand on any of these feelings or explore other emotions?\n"
     ]
    }
   ],
   "source": [
    "# Check that the LLM is working\n",
    "# confirmed that this generates different responses each time\n",
    "response = Settings.llm.complete(\"What are several important feelings?\")\n",
    "print(response.text)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "id": "9db5c9c2",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "HyDE output:\n",
      " Find entries with strong feelings of depression.\n"
     ]
    }
   ],
   "source": [
    "# Test for silent errors. Output verifies working.\n",
    "try:\n",
    "    hyde_result = hyde_transform.run(initial_query)\n",
    "    print(\"HyDE output:\\n\", hyde_result)\n",
    "except Exception as e:\n",
    "    print(\"LLM error:\", e)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d5add1ed",
   "metadata": {},
   "source": [
    "## Testing HyDE based on llamaindex documentation\n",
    "\n",
    "https://docs.llamaindex.ai/en/stable/examples/query_transformations/HyDEQueryTransformDemo/#querying-without-transformation-yields-reasonable-answer"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 54,
   "id": "90381bc2",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[\"Here is a passage that includes several key details about depression:\\n\\n**The Descent into Darkness**\\n\\nAs I lay in bed, staring blankly at the ceiling, I felt an overwhelming sense of hopelessness wash over me. The darkness seemed to close in around me, suffocating me with its crushing weight. Every thought felt like a burden, every decision a chore. I couldn't bear the idea of getting out of bed, of facing another day filled with anxiety and despair.\\n\\nI had been struggling with depression for what felt like an eternity. The symptoms had started slowly, a nagging feeling that something was off, but I had tried to brush it aside as mere exhaustion or stress. But as time went on, the feelings intensified, until they became all-consuming. I felt like I was drowning in a sea of sadness, unable to find a lifeline.\\n\\nThe smallest things would set me off - a harsh word from a loved one, a missed deadline at work, even just getting out of bed and facing another day. The world seemed too much for me to handle, and I retreated into my own private hell of despair. I couldn't eat, couldn't sleep, couldn't find any joy in the things that used to bring me happiness.\\n\\nAs I looked back on the past few months, I realized that this wasn't just a passing phase or a normal response to stress. Depression had taken hold, and it was suffocating me. I knew I needed help, but the thought of seeking treatment seemed daunting, even terrifying. What if they couldn't help me? What if I was stuck in this pit forever?\\n\\nI felt like I was losing myself, bit by bit, as depression consumed me. I longed for a glimmer of hope, a spark of light to guide me through the darkness. But it seemed elusive, always just out of reach.\\n\\nThis passage includes several key details about depression, including:\\n\\n* **Overwhelming feelings of sadness and hopelessness**: The protagonist feels an intense sense of despair that is difficult to shake.\\n* **Loss of motivation**: They feel like they can't get out of bed or face another day filled with anxiety and despair.\\n* **Withdrawal from activities**: They have lost interest in things that used to bring them joy, and are unable to eat or sleep.\\n* **Social isolation**: They retreat into their own private hell, feeling disconnected from others.\\n* **Loss of identity**: They feel like they are losing themselves as depression consumes them.\\n* **Fear of seeking help**: The protagonist is afraid to seek treatment, fearing that it won't work or that they will be stuck in this state forever.\",\n",
       " 'Find entries with strong feelings of depression.']"
      ]
     },
     "execution_count": 54,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "hyde = HyDEQueryTransform(llm=Settings.llm,include_original=True)\n",
    "query_str = \"Find entries with strong feelings of depression.\"\n",
    "query_bundle = hyde(query_str)\n",
    "hyde_doc = query_bundle.embedding_strs\n",
    "hyde_doc"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 55,
   "id": "08e7eca4",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "[\"Here is a passage that includes several key details about depression:\\n\\n**The Descent into Darkness**\\n\\nAs I lay in bed, staring blankly at the ceiling, I felt an overwhelming sense of hopelessness wash over me. The darkness seemed to close in around me, suffocating me with its crushing weight. Every thought felt like a burden, every decision a chore. I couldn't bear the idea of getting out of bed, of facing another day filled with anxiety and despair.\\n\\nI had been struggling with depression for what felt like an eternity. The symptoms had started slowly, a nagging feeling that something was off, but I had tried to brush it aside as mere exhaustion or stress. But as time went on, the feelings intensified, until they became all-consuming. I felt like I was drowning in a sea of sadness, unable to find a lifeline.\\n\\nThe smallest things would set me off - a harsh word from a loved one, a missed deadline at work, even just getting out of bed and facing another day. The world seemed too much for me to handle, and I retreated into my own private hell of despair. I couldn't eat, couldn't sleep, couldn't find any joy in the things that used to bring me happiness.\\n\\nAs I looked back on the past few months, I realized that this wasn't just a passing phase or a normal response to stress. Depression had taken hold, and it was suffocating me. I knew I needed help, but the thought of seeking treatment seemed daunting, even terrifying. What if they couldn't help me? What if I was stuck in this pit forever?\\n\\nI felt like I was losing myself, bit by bit, as depression consumed me. I longed for a glimmer of hope, a spark of light to guide me through the darkness. But it seemed elusive, always just out of reach.\\n\\nThis passage includes several key details about depression, including:\\n\\n* **Overwhelming feelings of sadness and hopelessness**: The protagonist feels an intense sense of despair that is difficult to shake.\\n* **Loss of motivation**: They feel like they can't get out of bed or face another day filled with anxiety and despair.\\n* **Withdrawal from activities**: They have lost interest in things that used to bring them joy, and are unable to eat or sleep.\\n* **Social isolation**: They retreat into their own private hell, feeling disconnected from others.\\n* **Loss of identity**: They feel like they are losing themselves as depression consumes them.\\n* **Fear of seeking help**: The protagonist is afraid to seek treatment, fearing that it won't work or that they will be stuck in this state forever.\", 'Find entries with strong feelings of depression.']"
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "from IPython.display import Markdown, display\n",
    "display(Markdown(f\"{hyde_doc}\"))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9ca50f9d",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": ".venv",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.11"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
 }
--- a/query_hybrid_bm25_v4.py
+++ b/query_hybrid_bm25_v4.py
@ -0,0 +1,176 @@
 # query_hybrid_bm25_v4.py
 # Hybrid retrieval: BM25 (sparse) + vector similarity (dense) + cross-encoder
 #
 # Combines two retrieval strategies to catch both exact term matches and
 # semantic similarity:
 #   1. Retrieve top-20 via vector similarity (bi-encoder, catches meaning)
 #   2. Retrieve top-20 via BM25 (term frequency, catches exact names/dates)
 #   3. Merge and deduplicate candidates by node ID
 #   4. Re-rank the union with a cross-encoder -> top-15
 #   5. Pass re-ranked chunks to LLM for synthesis
 #
 # The cross-encoder doesn't care where candidates came from -- it scores
 # each (query, chunk) pair on its own merits. BM25's job is just to
 # nominate candidates that vector similarity might miss.
 #
 # E.M.F. February 2026
 # Environment vars must be set before importing huggingface/transformers
 # libraries, because huggingface_hub.constants evaluates HF_HUB_OFFLINE
 # at import time.
 import os
 os.environ["TOKENIZERS_PARALLELISM"] = "false"
 os.environ["SENTENCE_TRANSFORMERS_HOME"] = "./models"
 os.environ["HF_HUB_OFFLINE"] = "1"
 from llama_index.core import (
    StorageContext,
    load_index_from_storage,
    Settings,
    get_response_synthesizer,
 )
 from llama_index.embeddings.huggingface import HuggingFaceEmbedding
 from llama_index.llms.ollama import Ollama
 from llama_index.core.prompts import PromptTemplate
 from llama_index.core.postprocessor import SentenceTransformerRerank
 from llama_index.retrievers.bm25 import BM25Retriever
 import sys
 #
 # Globals
 #
 # Embedding model (must match build_exp_claude.py)
 EMBED_MODEL = HuggingFaceEmbedding(cache_folder="./models", model_name="BAAI/bge-large-en-v1.5", local_files_only=True)
 # LLM model for generation
 LLM_MODEL = "command-r7b"
 # Cross-encoder model for re-ranking (cached in ./models/)
 RERANK_MODEL = "cross-encoder/ms-marco-MiniLM-L-12-v2"
 RERANK_TOP_N = 15
 # Retrieval parameters
 VECTOR_TOP_K = 20   # candidates from vector similarity
 BM25_TOP_K = 20     # candidates from BM25 term matching
 #
 # Custom prompt -- same as v3
 #
 PROMPT = PromptTemplate(
 """You are a precise research assistant analyzing excerpts from a personal journal collection.
 Every excerpt below has been selected and ranked for relevance to the query.
 CONTEXT (ranked by relevance):
 {context_str}
 QUERY:
 {query_str}
 Instructions:
 - Answer ONLY using information explicitly present in the CONTEXT above
 - Examine ALL provided excerpts, not just the top few -- each one was selected for relevance
 - Be specific: quote or closely paraphrase key passages and cite their file names
 - When multiple files touch on the query, note what each one contributes
 - If the context doesn't contain enough information to answer fully, say so
 Your response should:
 1. Directly answer the query, drawing on as many relevant excerpts as possible
 2. Reference specific files and their content (e.g., "In <filename>, ...")
 3. End with a list of all files that contributed to your answer, with a brief note on each
 If the context is insufficient, explain what's missing."""
 )
 def main():
    # Configure LLM and embedding model
    # for local model using ollama
    # Note: Ollama temperature defaults to 0.8
    Settings.llm = Ollama(
        model=LLM_MODEL,
        temperature=0.3,
        request_timeout=360.0,
        context_window=8000,
    )
    # Use OpenAI API:
    # from llama_index.llms.openai import OpenAI
    # Settings.llm = OpenAI(
    #     model="gpt-4o-mini",   # or "gpt-4o" for higher quality
    #     temperature=0.3,
    # )
    Settings.embed_model = EMBED_MODEL
    # Load persisted vector store
    storage_context = StorageContext.from_defaults(persist_dir="./storage_exp")
    index = load_index_from_storage(storage_context)
    # --- Retrievers ---
    # Vector retriever (dense: cosine similarity over embeddings)
    vector_retriever = index.as_retriever(similarity_top_k=VECTOR_TOP_K)
    # BM25 retriever (sparse: term frequency scoring)
    bm25_retriever = BM25Retriever.from_defaults(
        index=index,
        similarity_top_k=BM25_TOP_K,
    )
    # Cross-encoder re-ranker
    reranker = SentenceTransformerRerank(
        model=RERANK_MODEL,
        top_n=RERANK_TOP_N,
    )
    # --- Query ---
    if len(sys.argv) < 2:
        print("Usage: python query_hybrid_bm25_v4.py QUERY_TEXT")
        sys.exit(1)
    q = " ".join(sys.argv[1:])
    # Retrieve from both sources
    vector_nodes = vector_retriever.retrieve(q)
    bm25_nodes = bm25_retriever.retrieve(q)
    # Merge and deduplicate by node ID
    seen_ids = set()
    merged = []
    for node in vector_nodes + bm25_nodes:
        node_id = node.node.node_id
        if node_id not in seen_ids:
            seen_ids.add(node_id)
            merged.append(node)
    # Re-rank the merged candidates with cross-encoder
    reranked = reranker.postprocess_nodes(merged, query_str=q)
    # Report retrieval stats
    n_vector_only = len([n for n in vector_nodes if n.node.node_id not in {b.node.node_id for b in bm25_nodes}])
    n_bm25_only = len([n for n in bm25_nodes if n.node.node_id not in {v.node.node_id for v in vector_nodes}])
    n_both = len(vector_nodes) + len(bm25_nodes) - len(merged)
    print(f"\nQuery: {q}")
    print(f"Vector: {len(vector_nodes)}, BM25: {len(bm25_nodes)}, "
          f"overlap: {n_both}, merged: {len(merged)}, re-ranked to: {len(reranked)}")
    # Synthesize response with LLM
    synthesizer = get_response_synthesizer(text_qa_template=PROMPT)
    response = synthesizer.synthesize(q, nodes=reranked)
    # Output
    print("\nResponse:\n")
    print(response.response)
    print("\nSource documents:")
    for node in response.source_nodes:
        meta = getattr(node, "metadata", None) or node.node.metadata
        score = getattr(node, "score", None)
        print(f"{meta.get('file_name')}  {meta.get('file_path')}  {score:.3f}")
 if __name__ == "__main__":
    main()
--- a/query_topk_prompt_engine_v2.py
+++ b/query_topk_prompt_engine_v2.py
@ -0,0 +1,125 @@
 # query_topk_prompt_engine_v2.py
 # Run a querry on a vector store
 #
 # This version uses an improved prompt that is more flexible and query-adaptive
 # Based on query_topk_prompt_engine.py
 #
 # Implements a prompt and uses the build_exp.py vector store with BAAI/bge-large-en-v1.5
 #
 # E.M.F. January 2026
 from llama_index.core import (
    StorageContext,
    load_index_from_storage,
    ServiceContext,
    Settings,
 )
 from llama_index.embeddings.huggingface import HuggingFaceEmbedding
 from llama_index.llms.ollama import Ollama
 from llama_index.core.prompts import PromptTemplate
 import os
 import sys
 #
 # Globals
 #
 os.environ["TOKENIZERS_PARALLELISM"] = "false"
 # Embedding model used in vector store (this should match the one in build_exp.py or equivalent)
 # embed_model = HuggingFaceEmbedding(model_name="all-mpnet-base-v2")
 embed_model = HuggingFaceEmbedding(cache_folder="./models",model_name="BAAI/bge-large-en-v1.5",local_files_only=True)
 # LLM model to use in query transform and generation
 # command-r7b generates about as quickly as llama3.1:8B, but provides results that stick better
 # to the provided context
 llm="command-r7b"
 # Other models tried:
 #llm="llama3.1:8B"
 #llm="deepseek-r1:8B"
 #llm="gemma3:1b"
 #
 # Custom prompt for the query engine - Version 2 (improved)
 #
 # This prompt is more flexible and query-adaptive than v1:
 # - Doesn't force artificial structure (exactly 10 files, mandatory theme)
 # - Works for factual questions, exploratory queries, and comparisons
 # - Emphasizes precision with explicit citations
 # - Allows natural synthesis across sources
 # - Honest about limitations when context is insufficient
 #
 PROMPT = PromptTemplate(
 """You are a precise research assistant analyzing excerpts from a document collection.
 CONTEXT (ranked by relevance):
 {context_str}
 QUERY:
 {query_str}
 Instructions:
 - Answer ONLY using information explicitly present in the CONTEXT above
 - Prioritize higher-ranked excerpts but don't ignore lower ones if they contain unique relevant information
 - Be specific: cite file names and quote/paraphrase key passages when relevant
 - If the context doesn't contain enough information to answer fully, say so
 - Synthesize information across multiple sources when appropriate
 Your response should:
 1. Directly answer the query using the context
 2. Reference specific files and their content (e.g., "In <filename>, ...")
 3. List all relevant source files at the end with brief relevance notes
 If you find relevant information, organize it clearly. If the context is insufficient, explain what's missing."""
 )
 #
 # Main program routine
 #
 def main():
    # Use a local model to generate -- in this case using Ollama
    Settings.llm = Ollama(
        model=llm,    # First model tested
        request_timeout=360.0,
        context_window=8000
        )
    # Load embedding model (same as used for vector store)
    Settings.embed_model = embed_model
    # Load persisted vector store + metadata
    storage_context = StorageContext.from_defaults(persist_dir="./storage_exp")
    index = load_index_from_storage(storage_context)
    # Build regular query engine with custom prompt
    query_engine = index.as_query_engine(
        similarity_top_k=15,            # pull wide
        #response_mode="compact"        # concise synthesis
        text_qa_template=PROMPT,        # custom prompt (v2)
        # node_postprocessors=[
        #     SimilarityPostprocessor(similarity_cutoff=0.75)  # keep strong hits; makes result count flexible
        # ],
    )
    # Query
    if len(sys.argv) < 2:
        print("Usage: python query.py QUERY_TEXT")
        sys.exit(1)
    q = " ".join(sys.argv[1:])
    # Generate the response by querying the engine
    # This performes the similarity search and then applies the prompt
    response = query_engine.query(q)
    # Return the query response and source documents
    print("\nResponse:\n")
    print(response.response)
    print("\nSource documents:")
    for node in response.source_nodes:
        meta = getattr(node, "metadata", None) or node.node.metadata
        print(f"{meta.get('file_name')}  {meta.get('file_path')}  {getattr(node, 'score', None):.3f}")
 if __name__ == "__main__":
    main()
--- a/query_topk_prompt_engine_v3.py
+++ b/query_topk_prompt_engine_v3.py
@ -0,0 +1,136 @@
 # query_topk_prompt_engine_v3.py
 # Run a query on a vector store with cross-encoder re-ranking
 #
 # Based on v2. Adds a cross-encoder re-ranking step:
 #   1. Retrieve top-30 chunks via vector similarity (bi-encoder, fast)
 #   2. Re-rank to top-15 using a cross-encoder (slower but more accurate)
 #   3. Pass re-ranked chunks to LLM for synthesis
 #
 # The cross-encoder scores each (query, chunk) pair jointly, which captures
 # nuance that bi-encoder dot-product similarity misses.
 #
 # E.M.F. February 2026
 # Environment vars must be set before importing huggingface/transformers
 # libraries, because huggingface_hub.constants evaluates HF_HUB_OFFLINE
 # at import time.
 import os
 os.environ["TOKENIZERS_PARALLELISM"] = "false"
 os.environ["SENTENCE_TRANSFORMERS_HOME"] = "./models"
 os.environ["HF_HUB_OFFLINE"] = "1"
 from llama_index.core import (
    StorageContext,
    load_index_from_storage,
    Settings,
 )
 from llama_index.embeddings.huggingface import HuggingFaceEmbedding
 from llama_index.llms.ollama import Ollama
 from llama_index.core.prompts import PromptTemplate
 from llama_index.core.postprocessor import SentenceTransformerRerank
 import sys
 #
 # Globals
 #
 # Embedding model used in vector store (must match build_exp_claude.py)
 EMBED_MODEL = HuggingFaceEmbedding(cache_folder="./models", model_name="BAAI/bge-large-en-v1.5", local_files_only=True)
 # LLM model for generation
 llm = "command-r7b"
 # Cross-encoder model for re-ranking (cached in ./models/)
 #RERANK_MODEL = "cross-encoder/ms-marco-MiniLM-L-6-v2"
 RERANK_MODEL = "cross-encoder/ms-marco-MiniLM-L-12-v2"
 #RERANK_MODEL = "cross-encoder/stsb-roberta-base"
 #RERANK_MODEL = "BAAI/bge-reranker-v2-m3"
 RERANK_TOP_N = 15   # keep top 15 after re-ranking
 RETRIEVE_TOP_K = 30  # retrieve wider pool for re-ranker to work with
 #
 # Custom prompt for the query engine - Version 3
 #
 # Adapted for re-ranked context: every excerpt below has been scored for
 # relevance by a cross-encoder, so even lower-ranked ones are worth examining.
 # The prompt encourages the LLM to draw from all provided excerpts and to
 # note what each distinct file contributes rather than collapsing onto one.
 #
 PROMPT = PromptTemplate(
 """You are a precise research assistant analyzing excerpts from a personal journal collection.
 Every excerpt below has been selected and ranked for relevance to the query.
 CONTEXT (ranked by relevance):
 {context_str}
 QUERY:
 {query_str}
 Instructions:
 - Answer ONLY using information explicitly present in the CONTEXT above
 - Examine ALL provided excerpts, not just the top few -- each one was selected for relevance
 - Be specific: quote or closely paraphrase key passages and cite their file names
 - When multiple files touch on the query, note what each one contributes
 - If the context doesn't contain enough information to answer fully, say so
 Your response should:
 1. Directly answer the query, drawing on as many relevant excerpts as possible
 2. Reference specific files and their content (e.g., "In <filename>, ...")
 3. End with a list of all files that contributed to your answer, with a brief note on each
 If the context is insufficient, explain what's missing."""
 )
 #
 # Main program routine
 #
 def main():
    # Use a local model to generate -- in this case using Ollama
    Settings.llm = Ollama(
        model=llm,
        request_timeout=360.0,
        context_window=8000
    )
    # Load embedding model (same as used for vector store)
    Settings.embed_model = EMBED_MODEL
    # Load persisted vector store + metadata
    storage_context = StorageContext.from_defaults(persist_dir="./storage_exp")
    index = load_index_from_storage(storage_context)
    # Cross-encoder re-ranker
    reranker = SentenceTransformerRerank(
        model=RERANK_MODEL,
        top_n=RERANK_TOP_N,
    )
    # Build query engine: retrieve wide (top-30), re-rank to top-15, then synthesize
    query_engine = index.as_query_engine(
        similarity_top_k=RETRIEVE_TOP_K,
        text_qa_template=PROMPT,
        node_postprocessors=[reranker],
    )
    # Query
    if len(sys.argv) < 2:
        print("Usage: python query_topk_prompt_engine_v3.py QUERY_TEXT")
        sys.exit(1)
    q = " ".join(sys.argv[1:])
    # Generate the response by querying the engine
    response = query_engine.query(q)
    # Return the query response and source documents
    print("\nResponse:\n")
    print(response.response)
    print("\nSource documents:")
    for node in response.source_nodes:
        meta = getattr(node, "metadata", None) or node.node.metadata
        print(f"{meta.get('file_name')}  {meta.get('file_path')}  {getattr(node, 'score', None):.3f}")
 if __name__ == "__main__":
    main()
--- a/requirements.txt
+++ b/requirements.txt
@ -0,0 +1,171 @@
 aiohappyeyeballs==2.6.1
 aiohttp==3.12.15
 aiosignal==1.4.0
 aiosqlite==0.21.0
 annotated-types==0.7.0
 anyio==4.10.0
 appnope==0.1.4
 argon2-cffi==25.1.0
 argon2-cffi-bindings==25.1.0
 arrow==1.3.0
 asttokens==3.0.0
 async-lru==2.0.5
 attrs==25.3.0
 babel==2.17.0
 banks==2.2.0
 beautifulsoup4==4.13.4
 bleach==6.2.0
 bm25s==0.2.14
 certifi==2025.8.3
 cffi==1.17.1
 charset-normalizer==3.4.3
 click==8.2.1
 colorama==0.4.6
 comm==0.2.3
 contourpy==1.3.3
 cycler==0.12.1
 dataclasses-json==0.6.7
 debugpy==1.8.16
 decorator==5.2.1
 defusedxml==0.7.1
 Deprecated==1.2.18
 dirtyjson==1.0.8
 executing==2.2.0
 fastjsonschema==2.21.1
 filelock==3.18.0
 filetype==1.2.0
 fonttools==4.59.1
 fqdn==1.5.1
 frozenlist==1.7.0
 fsspec==2025.7.0
 greenlet==3.2.4
 griffe==1.11.0
 h11==0.16.0
 hf-xet==1.1.7
 httpcore==1.0.9
 httpx==0.28.1
 huggingface-hub==0.34.4
 idna==3.10
 ipykernel==6.30.1
 ipython==9.4.0
 ipython_pygments_lexers==1.1.1
 ipywidgets==8.1.7
 isoduration==20.11.0
 jedi==0.19.2
 Jinja2==3.1.6
 joblib==1.5.1
 json5==0.12.1
 jsonpointer==3.0.0
 jsonschema==4.25.0
 jsonschema-specifications==2025.4.1
 jupyter==1.1.1
 jupyter-console==6.6.3
 jupyter-events==0.12.0
 jupyter-lsp==2.2.6
 jupyter_client==8.6.3
 jupyter_core==5.8.1
 jupyter_server==2.16.0
 jupyter_server_terminals==0.5.3
 jupyterlab==4.4.5
 jupyterlab_pygments==0.3.0
 jupyterlab_server==2.27.3
 jupyterlab_widgets==3.0.15
 kiwisolver==1.4.9
 lark==1.2.2
 llama-index-core==0.13.1
 llama-index-embeddings-huggingface==0.6.0
 llama-index-instrumentation==0.4.0
 llama-index-llms-ollama==0.7.0
 llama-index-readers-file==0.5.0
 llama-index-retrievers-bm25==0.6.5
 llama-index-workflows==1.3.0
 MarkupSafe==3.0.2
 marshmallow==3.26.1
 matplotlib==3.10.5
 matplotlib-inline==0.1.7
 mistune==3.1.3
 mpmath==1.3.0
 multidict==6.6.3
 mypy_extensions==1.1.0
 nbclient==0.10.2
 nbconvert==7.16.6
 nbformat==5.10.4
 nest-asyncio==1.6.0
 networkx==3.5
 nltk==3.9.1
 notebook==7.4.5
 notebook_shim==0.2.4
 numpy==2.3.2
 ollama==0.5.3
 overrides==7.7.0
 packaging==25.0
 pandas==2.2.3
 pandocfilters==1.5.1
 parso==0.8.4
 pexpect==4.9.0
 pillow==11.3.0
 platformdirs==4.3.8
 prometheus_client==0.22.1
 prompt_toolkit==3.0.51
 propcache==0.3.2
 psutil==7.0.0
 ptyprocess==0.7.0
 pure_eval==0.2.3
 pycparser==2.22
 pydantic==2.11.7
 pydantic_core==2.33.2
 Pygments==2.19.2
 pyparsing==3.2.3
 pypdf==5.9.0
 PyStemmer==2.2.0.3
 python-dateutil==2.9.0.post0
 python-json-logger==3.3.0
 pytz==2025.2
 PyYAML==6.0.2
 pyzmq==27.0.1
 referencing==0.36.2
 regex==2025.7.34
 requests==2.32.4
 rfc3339-validator==0.1.4
 rfc3986-validator==0.1.1
 rfc3987-syntax==1.1.0
 rpds-py==0.27.0
 safetensors==0.6.2
 scikit-learn==1.7.1
 scipy==1.16.1
 seaborn==0.13.2
 Send2Trash==1.8.3
 sentence-transformers==5.1.0
 setuptools==80.9.0
 six==1.17.0
 sniffio==1.3.1
 soupsieve==2.7
 SQLAlchemy==2.0.42
 stack-data==0.6.3
 striprtf==0.0.26
 sympy==1.14.0
 tenacity==9.1.2
 terminado==0.18.1
 threadpoolctl==3.6.0
 tiktoken==0.11.0
 tinycss2==1.4.0
 tokenizers==0.21.4
 torch==2.8.0
 tornado==6.5.2
 tqdm==4.67.1
 traitlets==5.14.3
 transformers==4.55.0
 types-python-dateutil==2.9.0.20250809
 typing-inspect==0.9.0
 typing-inspection==0.4.1
 typing_extensions==4.14.1
 tzdata==2025.2
 uri-template==1.3.0
 urllib3==2.5.0
 wcwidth==0.2.13
 webcolors==24.11.1
 webencodings==0.5.1
 websocket-client==1.8.0
 widgetsnbextension==4.0.14
 wrapt==1.17.2
 yarl==1.20.1
--- a/retrieve_hybrid_raw.py
+++ b/retrieve_hybrid_raw.py
@ -0,0 +1,140 @@
 # retrieve_hybrid_raw.py
 # Hybrid verbatim chunk retrieval: BM25 + vector search + cross-encoder, no LLM.
 #
 # Same hybrid retrieval as query_hybrid_bm25_v4.py but outputs raw chunk text
 # instead of LLM synthesis. Useful for inspecting what the hybrid pipeline
 # retrieves and comparing against retrieve_raw.py (vector-only).
 #
 # Each chunk is annotated with its source (vector, BM25, or both) so you can
 # see which retriever nominated it.
 #
 # E.M.F. February 2026
 # Environment vars must be set before importing huggingface/transformers
 # libraries, because huggingface_hub.constants evaluates HF_HUB_OFFLINE
 # at import time.
 import os
 os.environ["TOKENIZERS_PARALLELISM"] = "false"
 os.environ["SENTENCE_TRANSFORMERS_HOME"] = "./models"
 os.environ["HF_HUB_OFFLINE"] = "1"
 from llama_index.core import (
    StorageContext,
    load_index_from_storage,
    Settings,
 )
 from llama_index.embeddings.huggingface import HuggingFaceEmbedding
 from llama_index.core.postprocessor import SentenceTransformerRerank
 from llama_index.retrievers.bm25 import BM25Retriever
 import sys
 import textwrap
 #
 # Globals
 #
 # Embedding model (must match build_exp_claude.py)
 EMBED_MODEL = HuggingFaceEmbedding(cache_folder="./models", model_name="BAAI/bge-large-en-v1.5", local_files_only=True)
 # Cross-encoder model for re-ranking (cached in ./models/)
 RERANK_MODEL = "cross-encoder/ms-marco-MiniLM-L-12-v2"
 RERANK_TOP_N = 15
 # Retrieval parameters
 VECTOR_TOP_K = 20
 BM25_TOP_K = 20
 # Output formatting
 WRAP_WIDTH = 80
 def main():
    # No LLM needed -- set embed model only
    Settings.embed_model = EMBED_MODEL
    # Load persisted vector store
    storage_context = StorageContext.from_defaults(persist_dir="./storage_exp")
    index = load_index_from_storage(storage_context)
    # --- Retrievers ---
    vector_retriever = index.as_retriever(similarity_top_k=VECTOR_TOP_K)
    bm25_retriever = BM25Retriever.from_defaults(
        index=index,
        similarity_top_k=BM25_TOP_K,
    )
    # Cross-encoder re-ranker
    reranker = SentenceTransformerRerank(
        model=RERANK_MODEL,
        top_n=RERANK_TOP_N,
    )
    # Query
    if len(sys.argv) < 2:
        print("Usage: python retrieve_hybrid_raw.py QUERY_TEXT")
        sys.exit(1)
    q = " ".join(sys.argv[1:])
    # Retrieve from both sources
    vector_nodes = vector_retriever.retrieve(q)
    bm25_nodes = bm25_retriever.retrieve(q)
    # Track which retriever found each node
    vector_ids = {n.node.node_id for n in vector_nodes}
    bm25_ids = {n.node.node_id for n in bm25_nodes}
    # Merge and deduplicate by node ID
    seen_ids = set()
    merged = []
    for node in vector_nodes + bm25_nodes:
        node_id = node.node.node_id
        if node_id not in seen_ids:
            seen_ids.add(node_id)
            merged.append(node)
    # Re-rank merged candidates
    reranked = reranker.postprocess_nodes(merged, query_str=q)
    # Retrieval stats
    n_both = len(vector_ids & bm25_ids)
    n_vector_only = len(vector_ids - bm25_ids)
    n_bm25_only = len(bm25_ids - vector_ids)
    print(f"\nQuery: {q}")
    print(f"Vector: {len(vector_nodes)}, BM25: {len(bm25_nodes)}, "
          f"overlap: {n_both}, merged: {len(merged)}, re-ranked to: {len(reranked)}")
    print(f"  vector-only: {n_vector_only}, bm25-only: {n_bm25_only}, both: {n_both}\n")
    # Output re-ranked chunks with source annotation
    for i, node in enumerate(reranked, 1):
        meta = getattr(node, "metadata", None) or node.node.metadata
        score = getattr(node, "score", None)
        file_name = meta.get("file_name", "unknown")
        text = node.get_content()
        node_id = node.node.node_id
        # Annotate source
        in_vector = node_id in vector_ids
        in_bm25 = node_id in bm25_ids
        if in_vector and in_bm25:
            source = "vector+bm25"
        elif in_bm25:
            source = "bm25-only"
        else:
            source = "vector-only"
        print("=" * WRAP_WIDTH)
        print(f"=== [{i}] {file_name}  (score: {score:.3f})  [{source}]")
        print("=" * WRAP_WIDTH)
        for line in text.splitlines():
            if line.strip():
                print(textwrap.fill(line, width=WRAP_WIDTH))
            else:
                print()
        print()
 if __name__ == "__main__":
    main()
--- a/retrieve_raw.py
+++ b/retrieve_raw.py
@ -0,0 +1,97 @@
 # retrieve_raw.py
 # Verbatim chunk retrieval: vector search + cross-encoder re-ranking, no LLM.
 #
 # Returns the top re-ranked chunks with their full text, file metadata, and
 # scores. Useful for browsing source material directly and verifying what
 # the RAG pipeline retrieves before LLM synthesis.
 #
 # Uses the same vector store, embedding model, and re-ranker as
 # query_topk_prompt_engine_v3.py, but skips the LLM step entirely.
 #
 # E.M.F. February 2026
 # Environment vars must be set before importing huggingface/transformers
 # libraries, because huggingface_hub.constants evaluates HF_HUB_OFFLINE
 # at import time.
 import os
 os.environ["TOKENIZERS_PARALLELISM"] = "false"
 os.environ["SENTENCE_TRANSFORMERS_HOME"] = "./models"
 os.environ["HF_HUB_OFFLINE"] = "1"
 from llama_index.core import (
    StorageContext,
    load_index_from_storage,
    Settings,
 )
 from llama_index.embeddings.huggingface import HuggingFaceEmbedding
 from llama_index.core.postprocessor import SentenceTransformerRerank
 import sys
 import textwrap
 #
 # Globals
 #
 # Embedding model (must match build_exp_claude.py)
 EMBED_MODEL = HuggingFaceEmbedding(cache_folder="./models", model_name="BAAI/bge-large-en-v1.5", local_files_only=True)
 # Cross-encoder model for re-ranking (cached in ./models/)
 RERANK_MODEL = "cross-encoder/ms-marco-MiniLM-L-12-v2"
 RERANK_TOP_N = 15
 RETRIEVE_TOP_K = 30
 # Output formatting
 WRAP_WIDTH = 80
 def main():
    # No LLM needed -- set embed model only
    Settings.embed_model = EMBED_MODEL
    # Load persisted vector store
    storage_context = StorageContext.from_defaults(persist_dir="./storage_exp")
    index = load_index_from_storage(storage_context)
    # Build retriever (vector search only, no query engine / LLM)
    retriever = index.as_retriever(similarity_top_k=RETRIEVE_TOP_K)
    # Cross-encoder re-ranker
    reranker = SentenceTransformerRerank(
        model=RERANK_MODEL,
        top_n=RERANK_TOP_N,
    )
    # Query
    if len(sys.argv) < 2:
        print("Usage: python retrieve_raw.py QUERY_TEXT")
        sys.exit(1)
    q = " ".join(sys.argv[1:])
    # Retrieve and re-rank
    nodes = retriever.retrieve(q)
    reranked = reranker.postprocess_nodes(nodes, query_str=q)
    # Output
    print(f"\nQuery: {q}")
    print(f"Retrieved {len(nodes)} chunks, re-ranked to top {len(reranked)}\n")
    for i, node in enumerate(reranked, 1):
        meta = getattr(node, "metadata", None) or node.node.metadata
        score = getattr(node, "score", None)
        file_name = meta.get("file_name", "unknown")
        text = node.get_content()
        print("="*WRAP_WIDTH)
        print(f"=== [{i}] {file_name}  (score: {score:.3f}) ")
        print("="*WRAP_WIDTH)
        # Wrap text for readability
        for line in text.splitlines():
            if line.strip():
                print(textwrap.fill(line, width=WRAP_WIDTH))
            else:
                print()
        print()
 if __name__ == "__main__":
    main()
--- a/run_query.sh
+++ b/run_query.sh
@ -0,0 +1,30 @@
 #!/bin/bash
 # This shell script will handle I/O for the python query engine
 # It will take a query and return the formatted results
 # E.M.F. August 2025
 # Usage: ./run_query.sh 
 QUERY_SCRIPT="query_hybrid_bm25_v4.py"
 echo -e "Current query engine is $QUERY_SCRIPT\n"
 # Loop until input is "exit"
 while true; do
    read -p "Enter your query (or type 'exit' to quit): " query
    if [ "$query" == "exit" ] || [ "$query" == "quit" ] || [ "$query" == "" ] ; then
        echo "Exiting..."
        break
    fi
    time_start=$(date +%s)
    # Call the python script with the query and format the output
    python3 $QUERY_SCRIPT --query "$query" | \
        expand | sed -E 's|(.* )(.*/data)|\1./data|' | fold -s -w 131
    time_end=$(date +%s)
    elapsed=$((time_end - time_start))
    echo -e "Query processed in $elapsed seconds.\n"
    echo $query >> query.log
 done
--- a/sandbox.ipynb
+++ b/sandbox.ipynb
@ -0,0 +1,973 @@
 {
 "cells": [
  {
   "cell_type": "markdown",
   "id": "11d5ae50",
   "metadata": {},
   "source": [
    "# llamaindex sandbox\n",
    "\n",
    "Using this to explore llamaindex\\\n",
    "August 2025"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "813f8b1a",
   "metadata": {},
   "outputs": [],
   "source": [
    "import llama_index.core"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "656faffb",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['BaseCallbackHandler', 'BasePromptTemplate', 'Callable', 'ChatPromptTemplate', 'ComposableGraph', 'Document', 'DocumentSummaryIndex', 'GPTDocumentSummaryIndex', 'GPTKeywordTableIndex', 'GPTListIndex', 'GPTRAKEKeywordTableIndex', 'GPTSimpleKeywordTableIndex', 'GPTTreeIndex', 'GPTVectorStoreIndex', 'IndexStructType', 'KeywordTableIndex', 'KnowledgeGraphIndex', 'ListIndex', 'MockEmbedding', 'NullHandler', 'Optional', 'Prompt', 'PromptHelper', 'PromptTemplate', 'PropertyGraphIndex', 'QueryBundle', 'RAKEKeywordTableIndex', 'Response', 'SQLContextBuilder', 'SQLDatabase', 'SQLDocumentContextBuilder', 'SelectorPromptTemplate', 'ServiceContext', 'Settings', 'SimpleDirectoryReader', 'SimpleKeywordTableIndex', 'StorageContext', 'SummaryIndex', 'TreeIndex', 'VectorStoreIndex', '__all__', '__annotations__', '__builtins__', '__cached__', '__doc__', '__file__', '__loader__', '__name__', '__package__', '__path__', '__spec__', '__version__', 'async_utils', 'base', 'bridge', 'callbacks', 'chat_engine', 'constants', 'data_structs', 'download', 'download_loader', 'embeddings', 'evaluation', 'get_response_synthesizer', 'get_tokenizer', 'global_handler', 'global_tokenizer', 'graph_stores', 'image_retriever', 'indices', 'ingestion', 'instrumentation', 'llama_dataset', 'llms', 'load_graph_from_storage', 'load_index_from_storage', 'load_indices_from_storage', 'logging', 'memory', 'multi_modal_llms', 'node_parser', 'objects', 'output_parsers', 'postprocessor', 'prompts', 'query_engine', 'question_gen', 'readers', 'response', 'response_synthesizers', 'schema', 'selectors', 'service_context', 'set_global_handler', 'set_global_service_context', 'set_global_tokenizer', 'settings', 'storage', 'tools', 'types', 'utilities', 'utils', 'vector_stores', 'workflow']\n"
     ]
    }
   ],
   "source": [
    "# List available objects\n",
    "print(dir(llama_index.core))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "bea0759d",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "BaseCallbackHandler\n",
      "BasePromptTemplate\n",
      "Callable\n",
      "ChatPromptTemplate\n",
      "ComposableGraph\n",
      "Document\n",
      "DocumentSummaryIndex\n",
      "GPTDocumentSummaryIndex\n",
      "GPTKeywordTableIndex\n",
      "GPTListIndex\n",
      "GPTRAKEKeywordTableIndex\n",
      "GPTSimpleKeywordTableIndex\n",
      "GPTTreeIndex\n",
      "GPTVectorStoreIndex\n",
      "IndexStructType\n",
      "KeywordTableIndex\n",
      "KnowledgeGraphIndex\n",
      "ListIndex\n",
      "MockEmbedding\n",
      "NullHandler\n",
      "Optional\n",
      "Prompt\n",
      "PromptHelper\n",
      "PromptTemplate\n",
      "PropertyGraphIndex\n",
      "QueryBundle\n",
      "RAKEKeywordTableIndex\n",
      "Response\n",
      "SQLContextBuilder\n",
      "SQLDatabase\n",
      "SQLDocumentContextBuilder\n",
      "SelectorPromptTemplate\n",
      "ServiceContext\n",
      "Settings\n",
      "SimpleDirectoryReader\n",
      "SimpleKeywordTableIndex\n",
      "StorageContext\n",
      "SummaryIndex\n",
      "TreeIndex\n",
      "VectorStoreIndex\n",
      "__all__\n",
      "__annotations__\n",
      "__builtins__\n",
      "__cached__\n",
      "__doc__\n",
      "__file__\n",
      "__loader__\n",
      "__name__\n",
      "__package__\n",
      "__path__\n",
      "__spec__\n",
      "__version__\n",
      "async_utils\n",
      "base\n",
      "bridge\n",
      "callbacks\n",
      "chat_engine\n",
      "constants\n",
      "data_structs\n",
      "download\n",
      "download_loader\n",
      "embeddings\n",
      "evaluation\n",
      "get_response_synthesizer\n",
      "get_tokenizer\n",
      "global_handler\n",
      "global_tokenizer\n",
      "graph_stores\n",
      "image_retriever\n",
      "indices\n",
      "ingestion\n",
      "instrumentation\n",
      "llama_dataset\n",
      "llms\n",
      "load_graph_from_storage\n",
      "load_index_from_storage\n",
      "load_indices_from_storage\n",
      "logging\n",
      "memory\n",
      "multi_modal_llms\n",
      "node_parser\n",
      "objects\n",
      "output_parsers\n",
      "postprocessor\n",
      "prompts\n",
      "query_engine\n",
      "question_gen\n",
      "readers\n",
      "response\n",
      "response_synthesizers\n",
      "schema\n",
      "selectors\n",
      "service_context\n",
      "set_global_handler\n",
      "set_global_service_context\n",
      "set_global_tokenizer\n",
      "settings\n",
      "storage\n",
      "tools\n",
      "types\n",
      "utilities\n",
      "utils\n",
      "vector_stores\n",
      "workflow\n"
     ]
    }
   ],
   "source": [
    "# Better formatted output for list of available objects\n",
    "objects = dir(llama_index.core)\n",
    "for obj in objects:\n",
    "    print(obj)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "id": "3886a5f0",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "list"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# dir returns a list\n",
    "type(objects)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "272cb0c9",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "104"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# In the case of llamaindex.core, it contains 104 objects\n",
    "\n",
    "len(objects)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "bfffc03f",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Help on class VectorStoreIndex in module llama_index.core.indices.vector_store.base:\n",
      "\n",
      "class VectorStoreIndex(llama_index.core.indices.base.BaseIndex)\n",
      " |  VectorStoreIndex(nodes: Optional[Sequence[llama_index.core.schema.BaseNode]] = None, use_async: bool = False, store_nodes_override: bool = False, embed_model: Union[llama_index.core.base.embeddings.base.BaseEmbedding, ForwardRef('LCEmbeddings'), str, NoneType] = None, insert_batch_size: int = 2048, objects: Optional[Sequence[llama_index.core.schema.IndexNode]] = None, index_struct: Optional[llama_index.core.data_structs.data_structs.IndexDict] = None, storage_context: Optional[llama_index.core.storage.storage_context.StorageContext] = None, callback_manager: Optional[llama_index.core.callbacks.base.CallbackManager] = None, transformations: Optional[List[llama_index.core.schema.TransformComponent]] = None, show_progress: bool = False, **kwargs: Any) -> None\n",
      " |\n",
      " |  Vector Store Index.\n",
      " |\n",
      " |  Args:\n",
      " |      use_async (bool): Whether to use asynchronous calls. Defaults to False.\n",
      " |      show_progress (bool): Whether to show tqdm progress bars. Defaults to False.\n",
      " |      store_nodes_override (bool): set to True to always store Node objects in index\n",
      " |          store and document store even if vector store keeps text. Defaults to False\n",
      " |\n",
      " |  Method resolution order:\n",
      " |      VectorStoreIndex\n",
      " |      llama_index.core.indices.base.BaseIndex\n",
      " |      typing.Generic\n",
      " |      abc.ABC\n",
      " |      builtins.object\n",
      " |\n",
      " |  Methods defined here:\n",
      " |\n",
      " |  __init__(self, nodes: Optional[Sequence[llama_index.core.schema.BaseNode]] = None, use_async: bool = False, store_nodes_override: bool = False, embed_model: Union[llama_index.core.base.embeddings.base.BaseEmbedding, ForwardRef('LCEmbeddings'), str, NoneType] = None, insert_batch_size: int = 2048, objects: Optional[Sequence[llama_index.core.schema.IndexNode]] = None, index_struct: Optional[llama_index.core.data_structs.data_structs.IndexDict] = None, storage_context: Optional[llama_index.core.storage.storage_context.StorageContext] = None, callback_manager: Optional[llama_index.core.callbacks.base.CallbackManager] = None, transformations: Optional[List[llama_index.core.schema.TransformComponent]] = None, show_progress: bool = False, **kwargs: Any) -> None\n",
      " |      Initialize params.\n",
      " |\n",
      " |  async adelete_nodes(self, node_ids: List[str], delete_from_docstore: bool = False, **delete_kwargs: Any) -> None\n",
      " |      Delete a list of nodes from the index.\n",
      " |\n",
      " |      Args:\n",
      " |          node_ids (List[str]): A list of node_ids from the nodes to delete\n",
      " |\n",
      " |  async adelete_ref_doc(self, ref_doc_id: str, delete_from_docstore: bool = False, **delete_kwargs: Any) -> None\n",
      " |      Delete a document and it's nodes by using ref_doc_id.\n",
      " |\n",
      " |  async ainsert_nodes(self, nodes: Sequence[llama_index.core.schema.BaseNode], **insert_kwargs: Any) -> None\n",
      " |      Insert nodes.\n",
      " |\n",
      " |      NOTE: overrides BaseIndex.ainsert_nodes.\n",
      " |          VectorStoreIndex only stores nodes in document store\n",
      " |          if vector store does not store text\n",
      " |\n",
      " |  as_retriever(self, **kwargs: Any) -> llama_index.core.base.base_retriever.BaseRetriever\n",
      " |\n",
      " |  build_index_from_nodes(self, nodes: Sequence[llama_index.core.schema.BaseNode], **insert_kwargs: Any) -> llama_index.core.data_structs.data_structs.IndexDict\n",
      " |      Build the index from nodes.\n",
      " |\n",
      " |      NOTE: Overrides BaseIndex.build_index_from_nodes.\n",
      " |          VectorStoreIndex only stores nodes in document store\n",
      " |          if vector store does not store text\n",
      " |\n",
      " |  delete_nodes(self, node_ids: List[str], delete_from_docstore: bool = False, **delete_kwargs: Any) -> None\n",
      " |      Delete a list of nodes from the index.\n",
      " |\n",
      " |      Args:\n",
      " |          node_ids (List[str]): A list of node_ids from the nodes to delete\n",
      " |\n",
      " |  delete_ref_doc(self, ref_doc_id: str, delete_from_docstore: bool = False, **delete_kwargs: Any) -> None\n",
      " |      Delete a document and it's nodes by using ref_doc_id.\n",
      " |\n",
      " |  insert_nodes(self, nodes: Sequence[llama_index.core.schema.BaseNode], **insert_kwargs: Any) -> None\n",
      " |      Insert nodes.\n",
      " |\n",
      " |      NOTE: overrides BaseIndex.insert_nodes.\n",
      " |          VectorStoreIndex only stores nodes in document store\n",
      " |          if vector store does not store text\n",
      " |\n",
      " |  ----------------------------------------------------------------------\n",
      " |  Class methods defined here:\n",
      " |\n",
      " |  from_vector_store(vector_store: llama_index.core.vector_stores.types.BasePydanticVectorStore, embed_model: Union[llama_index.core.base.embeddings.base.BaseEmbedding, ForwardRef('LCEmbeddings'), str, NoneType] = None, **kwargs: Any) -> 'VectorStoreIndex'\n",
      " |\n",
      " |  ----------------------------------------------------------------------\n",
      " |  Readonly properties defined here:\n",
      " |\n",
      " |  ref_doc_info\n",
      " |      Retrieve a dict mapping of ingested documents and their nodes+metadata.\n",
      " |\n",
      " |  vector_store\n",
      " |\n",
      " |  ----------------------------------------------------------------------\n",
      " |  Data and other attributes defined here:\n",
      " |\n",
      " |  __abstractmethods__ = frozenset()\n",
      " |\n",
      " |  __annotations__ = {}\n",
      " |\n",
      " |  __orig_bases__ = (llama_index.core.indices.base.BaseIndex[llama_index....\n",
      " |\n",
      " |  __parameters__ = ()\n",
      " |\n",
      " |  index_struct_cls = <class 'llama_index.core.data_structs.data_structs....\n",
      " |      A simple dictionary of documents.\n",
      " |\n",
      " |\n",
      " |  ----------------------------------------------------------------------\n",
      " |  Methods inherited from llama_index.core.indices.base.BaseIndex:\n",
      " |\n",
      " |  async ainsert(self, document: llama_index.core.schema.Document, **insert_kwargs: Any) -> None\n",
      " |      Asynchronously insert a document.\n",
      " |\n",
      " |  async arefresh_ref_docs(self, documents: Sequence[llama_index.core.schema.Document], **update_kwargs: Any) -> List[bool]\n",
      " |      Asynchronously refresh an index with documents that have changed.\n",
      " |\n",
      " |      This allows users to save LLM and Embedding model calls, while only\n",
      " |      updating documents that have any changes in text or metadata. It\n",
      " |      will also insert any documents that previously were not stored.\n",
      " |\n",
      " |  as_chat_engine(self, chat_mode: llama_index.core.chat_engine.types.ChatMode = <ChatMode.BEST: 'best'>, llm: Union[str, llama_index.core.llms.llm.LLM, ForwardRef('BaseLanguageModel'), NoneType] = None, **kwargs: Any) -> llama_index.core.chat_engine.types.BaseChatEngine\n",
      " |      Convert the index to a chat engine.\n",
      " |\n",
      " |      Calls `index.as_query_engine(llm=llm, **kwargs)` to get the query engine and then\n",
      " |      wraps it in a chat engine based on the chat mode.\n",
      " |\n",
      " |      Chat modes:\n",
      " |          - `ChatMode.BEST` (default): Chat engine that uses an agent (react or openai) with a query engine tool\n",
      " |          - `ChatMode.CONTEXT`: Chat engine that uses a retriever to get context\n",
      " |          - `ChatMode.CONDENSE_QUESTION`: Chat engine that condenses questions\n",
      " |          - `ChatMode.CONDENSE_PLUS_CONTEXT`: Chat engine that condenses questions and uses a retriever to get context\n",
      " |          - `ChatMode.SIMPLE`: Simple chat engine that uses the LLM directly\n",
      " |          - `ChatMode.REACT`: Chat engine that uses a react agent with a query engine tool\n",
      " |          - `ChatMode.OPENAI`: Chat engine that uses an openai agent with a query engine tool\n",
      " |\n",
      " |  as_query_engine(self, llm: Union[str, llama_index.core.llms.llm.LLM, ForwardRef('BaseLanguageModel'), NoneType] = None, **kwargs: Any) -> llama_index.core.base.base_query_engine.BaseQueryEngine\n",
      " |      Convert the index to a query engine.\n",
      " |\n",
      " |      Calls `index.as_retriever(**kwargs)` to get the retriever and then wraps it in a\n",
      " |      `RetrieverQueryEngine.from_args(retriever, **kwrags)` call.\n",
      " |\n",
      " |  async aupdate_ref_doc(self, document: llama_index.core.schema.Document, **update_kwargs: Any) -> None\n",
      " |      Asynchronously update a document and it's corresponding nodes.\n",
      " |\n",
      " |      This is equivalent to deleting the document and then inserting it again.\n",
      " |\n",
      " |      Args:\n",
      " |          document (Union[BaseDocument, BaseIndex]): document to update\n",
      " |          insert_kwargs (Dict): kwargs to pass to insert\n",
      " |          delete_kwargs (Dict): kwargs to pass to delete\n",
      " |\n",
      " |  delete(self, doc_id: str, **delete_kwargs: Any) -> None\n",
      " |      Delete a document from the index.\n",
      " |      All nodes in the index related to the index will be deleted.\n",
      " |\n",
      " |      Args:\n",
      " |          doc_id (str): A doc_id of the ingested document\n",
      " |\n",
      " |  insert(self, document: llama_index.core.schema.Document, **insert_kwargs: Any) -> None\n",
      " |      Insert a document.\n",
      " |\n",
      " |  refresh(self, documents: Sequence[llama_index.core.schema.Document], **update_kwargs: Any) -> List[bool]\n",
      " |      Refresh an index with documents that have changed.\n",
      " |\n",
      " |      This allows users to save LLM and Embedding model calls, while only\n",
      " |      updating documents that have any changes in text or metadata. It\n",
      " |      will also insert any documents that previously were not stored.\n",
      " |\n",
      " |  refresh_ref_docs(self, documents: Sequence[llama_index.core.schema.Document], **update_kwargs: Any) -> List[bool]\n",
      " |      Refresh an index with documents that have changed.\n",
      " |\n",
      " |      This allows users to save LLM and Embedding model calls, while only\n",
      " |      updating documents that have any changes in text or metadata. It\n",
      " |      will also insert any documents that previously were not stored.\n",
      " |\n",
      " |  set_index_id(self, index_id: str) -> None\n",
      " |      Set the index id.\n",
      " |\n",
      " |      NOTE: if you decide to set the index_id on the index_struct manually,\n",
      " |      you will need to explicitly call `add_index_struct` on the `index_store`\n",
      " |      to update the index store.\n",
      " |\n",
      " |      Args:\n",
      " |          index_id (str): Index id to set.\n",
      " |\n",
      " |  update(self, document: llama_index.core.schema.Document, **update_kwargs: Any) -> None\n",
      " |      Update a document and it's corresponding nodes.\n",
      " |\n",
      " |      This is equivalent to deleting the document and then inserting it again.\n",
      " |\n",
      " |      Args:\n",
      " |          document (Union[BaseDocument, BaseIndex]): document to update\n",
      " |          insert_kwargs (Dict): kwargs to pass to insert\n",
      " |          delete_kwargs (Dict): kwargs to pass to delete\n",
      " |\n",
      " |  update_ref_doc(self, document: llama_index.core.schema.Document, **update_kwargs: Any) -> None\n",
      " |      Update a document and it's corresponding nodes.\n",
      " |\n",
      " |      This is equivalent to deleting the document and then inserting it again.\n",
      " |\n",
      " |      Args:\n",
      " |          document (Union[BaseDocument, BaseIndex]): document to update\n",
      " |          insert_kwargs (Dict): kwargs to pass to insert\n",
      " |          delete_kwargs (Dict): kwargs to pass to delete\n",
      " |\n",
      " |  ----------------------------------------------------------------------\n",
      " |  Class methods inherited from llama_index.core.indices.base.BaseIndex:\n",
      " |\n",
      " |  from_documents(documents: Sequence[llama_index.core.schema.Document], storage_context: Optional[llama_index.core.storage.storage_context.StorageContext] = None, show_progress: bool = False, callback_manager: Optional[llama_index.core.callbacks.base.CallbackManager] = None, transformations: Optional[List[llama_index.core.schema.TransformComponent]] = None, **kwargs: Any) -> ~IndexType\n",
      " |      Create index from documents.\n",
      " |\n",
      " |      Args:\n",
      " |          documents (Sequence[Document]]): List of documents to\n",
      " |              build the index from.\n",
      " |\n",
      " |  ----------------------------------------------------------------------\n",
      " |  Readonly properties inherited from llama_index.core.indices.base.BaseIndex:\n",
      " |\n",
      " |  docstore\n",
      " |      Get the docstore corresponding to the index.\n",
      " |\n",
      " |  index_id\n",
      " |      Get the index struct.\n",
      " |\n",
      " |  index_struct\n",
      " |      Get the index struct.\n",
      " |\n",
      " |  storage_context\n",
      " |\n",
      " |  ----------------------------------------------------------------------\n",
      " |  Data descriptors inherited from llama_index.core.indices.base.BaseIndex:\n",
      " |\n",
      " |  __dict__\n",
      " |      dictionary for instance variables\n",
      " |\n",
      " |  __weakref__\n",
      " |      list of weak references to the object\n",
      " |\n",
      " |  summary\n",
      " |\n",
      " |  ----------------------------------------------------------------------\n",
      " |  Class methods inherited from typing.Generic:\n",
      " |\n",
      " |  __class_getitem__(...)\n",
      " |      Parameterizes a generic class.\n",
      " |\n",
      " |      At least, parameterizing a generic class is the *main* thing this\n",
      " |      method does. For example, for some generic class `Foo`, this is called\n",
      " |      when we do `Foo[int]` - there, with `cls=Foo` and `params=int`.\n",
      " |\n",
      " |      However, note that this method is also called when defining generic\n",
      " |      classes in the first place with `class Foo[T]: ...`.\n",
      " |\n",
      " |  __init_subclass__(...)\n",
      " |      Function to initialize subclasses.\n",
      "\n"
     ]
    }
   ],
   "source": [
    "# Get help on a specific object\n",
    "help(llama_index.core.VectorStoreIndex)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "3eb5f1b7",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "class VectorStoreIndex(BaseIndex[IndexDict]):\n",
      "    \"\"\"\n",
      "    Vector Store Index.\n",
      "\n",
      "    Args:\n",
      "        use_async (bool): Whether to use asynchronous calls. Defaults to False.\n",
      "        show_progress (bool): Whether to show tqdm progress bars. Defaults to False.\n",
      "        store_nodes_override (bool): set to True to always store Node objects in index\n",
      "            store and document store even if vector store keeps text. Defaults to False\n",
      "\n",
      "    \"\"\"\n",
      "\n",
      "    index_struct_cls = IndexDict\n",
      "\n",
      "    def __init__(\n",
      "        self,\n",
      "        nodes: Optional[Sequence[BaseNode]] = None,\n",
      "        # vector store index params\n",
      "        use_async: bool = False,\n",
      "        store_nodes_override: bool = False,\n",
      "        embed_model: Optional[EmbedType] = None,\n",
      "        insert_batch_size: int = 2048,\n",
      "        # parent class params\n",
      "        objects: Optional[Sequence[IndexNode]] = None,\n",
      "        index_struct: Optional[IndexDict] = None,\n",
      "        storage_context: Optional[StorageContext] = None,\n",
      "        callback_manager: Optional[CallbackManager] = None,\n",
      "        transformations: Optional[List[TransformComponent]] = None,\n",
      "        show_progress: bool = False,\n",
      "        **kwargs: Any,\n",
      "    ) -> None:\n",
      "        \"\"\"Initialize params.\"\"\"\n",
      "        self._use_async = use_async\n",
      "        self._store_nodes_override = store_nodes_override\n",
      "        self._embed_model = resolve_embed_model(\n",
      "            embed_model or Settings.embed_model, callback_manager=callback_manager\n",
      "        )\n",
      "\n",
      "        self._insert_batch_size = insert_batch_size\n",
      "        super().__init__(\n",
      "            nodes=nodes,\n",
      "            index_struct=index_struct,\n",
      "            storage_context=storage_context,\n",
      "            show_progress=show_progress,\n",
      "            objects=objects,\n",
      "            callback_manager=callback_manager,\n",
      "            transformations=transformations,\n",
      "            **kwargs,\n",
      "        )\n",
      "\n",
      "    @classmethod\n",
      "    def from_vector_store(\n",
      "        cls,\n",
      "        vector_store: BasePydanticVectorStore,\n",
      "        embed_model: Optional[EmbedType] = None,\n",
      "        **kwargs: Any,\n",
      "    ) -> \"VectorStoreIndex\":\n",
      "        if not vector_store.stores_text:\n",
      "            raise ValueError(\n",
      "                \"Cannot initialize from a vector store that does not store text.\"\n",
      "            )\n",
      "\n",
      "        kwargs.pop(\"storage_context\", None)\n",
      "        storage_context = StorageContext.from_defaults(vector_store=vector_store)\n",
      "\n",
      "        return cls(\n",
      "            nodes=[],\n",
      "            embed_model=embed_model,\n",
      "            storage_context=storage_context,\n",
      "            **kwargs,\n",
      "        )\n",
      "\n",
      "    @property\n",
      "    def vector_store(self) -> BasePydanticVectorStore:\n",
      "        return self._vector_store\n",
      "\n",
      "    def as_retriever(self, **kwargs: Any) -> BaseRetriever:\n",
      "        # NOTE: lazy import\n",
      "        from llama_index.core.indices.vector_store.retrievers import (\n",
      "            VectorIndexRetriever,\n",
      "        )\n",
      "\n",
      "        return VectorIndexRetriever(\n",
      "            self,\n",
      "            node_ids=list(self.index_struct.nodes_dict.values()),\n",
      "            callback_manager=self._callback_manager,\n",
      "            object_map=self._object_map,\n",
      "            **kwargs,\n",
      "        )\n",
      "\n",
      "    def _get_node_with_embedding(\n",
      "        self,\n",
      "        nodes: Sequence[BaseNode],\n",
      "        show_progress: bool = False,\n",
      "    ) -> List[BaseNode]:\n",
      "        \"\"\"\n",
      "        Get tuples of id, node, and embedding.\n",
      "\n",
      "        Allows us to store these nodes in a vector store.\n",
      "        Embeddings are called in batches.\n",
      "\n",
      "        \"\"\"\n",
      "        id_to_embed_map = embed_nodes(\n",
      "            nodes, self._embed_model, show_progress=show_progress\n",
      "        )\n",
      "\n",
      "        results = []\n",
      "        for node in nodes:\n",
      "            embedding = id_to_embed_map[node.node_id]\n",
      "            result = node.model_copy()\n",
      "            result.embedding = embedding\n",
      "            results.append(result)\n",
      "        return results\n",
      "\n",
      "    async def _aget_node_with_embedding(\n",
      "        self,\n",
      "        nodes: Sequence[BaseNode],\n",
      "        show_progress: bool = False,\n",
      "    ) -> List[BaseNode]:\n",
      "        \"\"\"\n",
      "        Asynchronously get tuples of id, node, and embedding.\n",
      "\n",
      "        Allows us to store these nodes in a vector store.\n",
      "        Embeddings are called in batches.\n",
      "\n",
      "        \"\"\"\n",
      "        id_to_embed_map = await async_embed_nodes(\n",
      "            nodes=nodes,\n",
      "            embed_model=self._embed_model,\n",
      "            show_progress=show_progress,\n",
      "        )\n",
      "\n",
      "        results = []\n",
      "        for node in nodes:\n",
      "            embedding = id_to_embed_map[node.node_id]\n",
      "            result = node.model_copy()\n",
      "            result.embedding = embedding\n",
      "            results.append(result)\n",
      "        return results\n",
      "\n",
      "    async def _async_add_nodes_to_index(\n",
      "        self,\n",
      "        index_struct: IndexDict,\n",
      "        nodes: Sequence[BaseNode],\n",
      "        show_progress: bool = False,\n",
      "        **insert_kwargs: Any,\n",
      "    ) -> None:\n",
      "        \"\"\"Asynchronously add nodes to index.\"\"\"\n",
      "        if not nodes:\n",
      "            return\n",
      "\n",
      "        for nodes_batch in iter_batch(nodes, self._insert_batch_size):\n",
      "            nodes_batch = await self._aget_node_with_embedding(\n",
      "                nodes_batch, show_progress\n",
      "            )\n",
      "            new_ids = await self._vector_store.async_add(nodes_batch, **insert_kwargs)\n",
      "\n",
      "            # if the vector store doesn't store text, we need to add the nodes to the\n",
      "            # index struct and document store\n",
      "            if not self._vector_store.stores_text or self._store_nodes_override:\n",
      "                for node, new_id in zip(nodes_batch, new_ids):\n",
      "                    # NOTE: remove embedding from node to avoid duplication\n",
      "                    node_without_embedding = node.model_copy()\n",
      "                    node_without_embedding.embedding = None\n",
      "\n",
      "                    index_struct.add_node(node_without_embedding, text_id=new_id)\n",
      "                    await self._docstore.async_add_documents(\n",
      "                        [node_without_embedding], allow_update=True\n",
      "                    )\n",
      "            else:\n",
      "                # NOTE: if the vector store keeps text,\n",
      "                # we only need to add image and index nodes\n",
      "                for node, new_id in zip(nodes_batch, new_ids):\n",
      "                    if isinstance(node, (ImageNode, IndexNode)):\n",
      "                        # NOTE: remove embedding from node to avoid duplication\n",
      "                        node_without_embedding = node.model_copy()\n",
      "                        node_without_embedding.embedding = None\n",
      "\n",
      "                        index_struct.add_node(node_without_embedding, text_id=new_id)\n",
      "                        await self._docstore.async_add_documents(\n",
      "                            [node_without_embedding], allow_update=True\n",
      "                        )\n",
      "\n",
      "    def _add_nodes_to_index(\n",
      "        self,\n",
      "        index_struct: IndexDict,\n",
      "        nodes: Sequence[BaseNode],\n",
      "        show_progress: bool = False,\n",
      "        **insert_kwargs: Any,\n",
      "    ) -> None:\n",
      "        \"\"\"Add document to index.\"\"\"\n",
      "        if not nodes:\n",
      "            return\n",
      "\n",
      "        for nodes_batch in iter_batch(nodes, self._insert_batch_size):\n",
      "            nodes_batch = self._get_node_with_embedding(nodes_batch, show_progress)\n",
      "            new_ids = self._vector_store.add(nodes_batch, **insert_kwargs)\n",
      "\n",
      "            if not self._vector_store.stores_text or self._store_nodes_override:\n",
      "                # NOTE: if the vector store doesn't store text,\n",
      "                # we need to add the nodes to the index struct and document store\n",
      "                for node, new_id in zip(nodes_batch, new_ids):\n",
      "                    # NOTE: remove embedding from node to avoid duplication\n",
      "                    node_without_embedding = node.model_copy()\n",
      "                    node_without_embedding.embedding = None\n",
      "\n",
      "                    index_struct.add_node(node_without_embedding, text_id=new_id)\n",
      "                    self._docstore.add_documents(\n",
      "                        [node_without_embedding], allow_update=True\n",
      "                    )\n",
      "            else:\n",
      "                # NOTE: if the vector store keeps text,\n",
      "                # we only need to add image and index nodes\n",
      "                for node, new_id in zip(nodes_batch, new_ids):\n",
      "                    if isinstance(node, (ImageNode, IndexNode)):\n",
      "                        # NOTE: remove embedding from node to avoid duplication\n",
      "                        node_without_embedding = node.model_copy()\n",
      "                        node_without_embedding.embedding = None\n",
      "\n",
      "                        index_struct.add_node(node_without_embedding, text_id=new_id)\n",
      "                        self._docstore.add_documents(\n",
      "                            [node_without_embedding], allow_update=True\n",
      "                        )\n",
      "\n",
      "    def _build_index_from_nodes(\n",
      "        self,\n",
      "        nodes: Sequence[BaseNode],\n",
      "        **insert_kwargs: Any,\n",
      "    ) -> IndexDict:\n",
      "        \"\"\"Build index from nodes.\"\"\"\n",
      "        index_struct = self.index_struct_cls()\n",
      "        if self._use_async:\n",
      "            tasks = [\n",
      "                self._async_add_nodes_to_index(\n",
      "                    index_struct,\n",
      "                    nodes,\n",
      "                    show_progress=self._show_progress,\n",
      "                    **insert_kwargs,\n",
      "                )\n",
      "            ]\n",
      "            run_async_tasks(tasks)\n",
      "        else:\n",
      "            self._add_nodes_to_index(\n",
      "                index_struct,\n",
      "                nodes,\n",
      "                show_progress=self._show_progress,\n",
      "                **insert_kwargs,\n",
      "            )\n",
      "        return index_struct\n",
      "\n",
      "    def build_index_from_nodes(\n",
      "        self,\n",
      "        nodes: Sequence[BaseNode],\n",
      "        **insert_kwargs: Any,\n",
      "    ) -> IndexDict:\n",
      "        \"\"\"\n",
      "        Build the index from nodes.\n",
      "\n",
      "        NOTE: Overrides BaseIndex.build_index_from_nodes.\n",
      "            VectorStoreIndex only stores nodes in document store\n",
      "            if vector store does not store text\n",
      "        \"\"\"\n",
      "        # Filter out the nodes that don't have content\n",
      "        content_nodes = [\n",
      "            node\n",
      "            for node in nodes\n",
      "            if node.get_content(metadata_mode=MetadataMode.EMBED) != \"\"\n",
      "        ]\n",
      "\n",
      "        # Report if some nodes are missing content\n",
      "        if len(content_nodes) != len(nodes):\n",
      "            print(\"Some nodes are missing content, skipping them...\")\n",
      "\n",
      "        return self._build_index_from_nodes(content_nodes, **insert_kwargs)\n",
      "\n",
      "    def _insert(self, nodes: Sequence[BaseNode], **insert_kwargs: Any) -> None:\n",
      "        \"\"\"Insert a document.\"\"\"\n",
      "        self._add_nodes_to_index(self._index_struct, nodes, **insert_kwargs)\n",
      "\n",
      "    def _validate_serializable(self, nodes: Sequence[BaseNode]) -> None:\n",
      "        \"\"\"Validate that the nodes are serializable.\"\"\"\n",
      "        for node in nodes:\n",
      "            if isinstance(node, IndexNode):\n",
      "                try:\n",
      "                    node.dict()\n",
      "                except ValueError:\n",
      "                    self._object_map[node.index_id] = node.obj\n",
      "                    node.obj = None\n",
      "\n",
      "    async def ainsert_nodes(\n",
      "        self, nodes: Sequence[BaseNode], **insert_kwargs: Any\n",
      "    ) -> None:\n",
      "        \"\"\"\n",
      "        Insert nodes.\n",
      "\n",
      "        NOTE: overrides BaseIndex.ainsert_nodes.\n",
      "            VectorStoreIndex only stores nodes in document store\n",
      "            if vector store does not store text\n",
      "        \"\"\"\n",
      "        self._validate_serializable(nodes)\n",
      "\n",
      "        with self._callback_manager.as_trace(\"insert_nodes\"):\n",
      "            await self._async_add_nodes_to_index(\n",
      "                self._index_struct, nodes, **insert_kwargs\n",
      "            )\n",
      "            self._storage_context.index_store.add_index_struct(self._index_struct)\n",
      "\n",
      "    def insert_nodes(self, nodes: Sequence[BaseNode], **insert_kwargs: Any) -> None:\n",
      "        \"\"\"\n",
      "        Insert nodes.\n",
      "\n",
      "        NOTE: overrides BaseIndex.insert_nodes.\n",
      "            VectorStoreIndex only stores nodes in document store\n",
      "            if vector store does not store text\n",
      "        \"\"\"\n",
      "        self._validate_serializable(nodes)\n",
      "\n",
      "        with self._callback_manager.as_trace(\"insert_nodes\"):\n",
      "            self._insert(nodes, **insert_kwargs)\n",
      "            self._storage_context.index_store.add_index_struct(self._index_struct)\n",
      "\n",
      "    def _delete_node(self, node_id: str, **delete_kwargs: Any) -> None:\n",
      "        pass\n",
      "\n",
      "    async def adelete_nodes(\n",
      "        self,\n",
      "        node_ids: List[str],\n",
      "        delete_from_docstore: bool = False,\n",
      "        **delete_kwargs: Any,\n",
      "    ) -> None:\n",
      "        \"\"\"\n",
      "        Delete a list of nodes from the index.\n",
      "\n",
      "        Args:\n",
      "            node_ids (List[str]): A list of node_ids from the nodes to delete\n",
      "\n",
      "        \"\"\"\n",
      "        # delete nodes from vector store\n",
      "        await self._vector_store.adelete_nodes(node_ids, **delete_kwargs)\n",
      "\n",
      "        # delete from docstore only if needed\n",
      "        if (\n",
      "            not self._vector_store.stores_text or self._store_nodes_override\n",
      "        ) and delete_from_docstore:\n",
      "            for node_id in node_ids:\n",
      "                self._index_struct.delete(node_id)\n",
      "                await self._docstore.adelete_document(node_id, raise_error=False)\n",
      "            self._storage_context.index_store.add_index_struct(self._index_struct)\n",
      "\n",
      "    def delete_nodes(\n",
      "        self,\n",
      "        node_ids: List[str],\n",
      "        delete_from_docstore: bool = False,\n",
      "        **delete_kwargs: Any,\n",
      "    ) -> None:\n",
      "        \"\"\"\n",
      "        Delete a list of nodes from the index.\n",
      "\n",
      "        Args:\n",
      "            node_ids (List[str]): A list of node_ids from the nodes to delete\n",
      "\n",
      "        \"\"\"\n",
      "        # delete nodes from vector store\n",
      "        self._vector_store.delete_nodes(node_ids, **delete_kwargs)\n",
      "\n",
      "        # delete from docstore only if needed\n",
      "        if (\n",
      "            not self._vector_store.stores_text or self._store_nodes_override\n",
      "        ) and delete_from_docstore:\n",
      "            for node_id in node_ids:\n",
      "                self._index_struct.delete(node_id)\n",
      "                self._docstore.delete_document(node_id, raise_error=False)\n",
      "            self._storage_context.index_store.add_index_struct(self._index_struct)\n",
      "\n",
      "    def _delete_from_index_struct(self, ref_doc_id: str) -> None:\n",
      "        # delete from index_struct only if needed\n",
      "        if not self._vector_store.stores_text or self._store_nodes_override:\n",
      "            ref_doc_info = self._docstore.get_ref_doc_info(ref_doc_id)\n",
      "            if ref_doc_info is not None:\n",
      "                for node_id in ref_doc_info.node_ids:\n",
      "                    self._index_struct.delete(node_id)\n",
      "                    self._vector_store.delete(node_id)\n",
      "\n",
      "    def _delete_from_docstore(self, ref_doc_id: str) -> None:\n",
      "        # delete from docstore only if needed\n",
      "        if not self._vector_store.stores_text or self._store_nodes_override:\n",
      "            self._docstore.delete_ref_doc(ref_doc_id, raise_error=False)\n",
      "\n",
      "    def delete_ref_doc(\n",
      "        self, ref_doc_id: str, delete_from_docstore: bool = False, **delete_kwargs: Any\n",
      "    ) -> None:\n",
      "        \"\"\"Delete a document and it's nodes by using ref_doc_id.\"\"\"\n",
      "        self._vector_store.delete(ref_doc_id, **delete_kwargs)\n",
      "        self._delete_from_index_struct(ref_doc_id)\n",
      "        if delete_from_docstore:\n",
      "            self._delete_from_docstore(ref_doc_id)\n",
      "        self._storage_context.index_store.add_index_struct(self._index_struct)\n",
      "\n",
      "    async def _adelete_from_index_struct(self, ref_doc_id: str) -> None:\n",
      "        \"\"\"Delete from index_struct only if needed.\"\"\"\n",
      "        if not self._vector_store.stores_text or self._store_nodes_override:\n",
      "            ref_doc_info = await self._docstore.aget_ref_doc_info(ref_doc_id)\n",
      "            if ref_doc_info is not None:\n",
      "                for node_id in ref_doc_info.node_ids:\n",
      "                    self._index_struct.delete(node_id)\n",
      "                    self._vector_store.delete(node_id)\n",
      "\n",
      "    async def _adelete_from_docstore(self, ref_doc_id: str) -> None:\n",
      "        \"\"\"Delete from docstore only if needed.\"\"\"\n",
      "        if not self._vector_store.stores_text or self._store_nodes_override:\n",
      "            await self._docstore.adelete_ref_doc(ref_doc_id, raise_error=False)\n",
      "\n",
      "    async def adelete_ref_doc(\n",
      "        self, ref_doc_id: str, delete_from_docstore: bool = False, **delete_kwargs: Any\n",
      "    ) -> None:\n",
      "        \"\"\"Delete a document and it's nodes by using ref_doc_id.\"\"\"\n",
      "        tasks = [\n",
      "            self._vector_store.adelete(ref_doc_id, **delete_kwargs),\n",
      "            self._adelete_from_index_struct(ref_doc_id),\n",
      "        ]\n",
      "        if delete_from_docstore:\n",
      "            tasks.append(self._adelete_from_docstore(ref_doc_id))\n",
      "\n",
      "        await asyncio.gather(*tasks)\n",
      "\n",
      "        self._storage_context.index_store.add_index_struct(self._index_struct)\n",
      "\n",
      "    @property\n",
      "    def ref_doc_info(self) -> Dict[str, RefDocInfo]:\n",
      "        \"\"\"Retrieve a dict mapping of ingested documents and their nodes+metadata.\"\"\"\n",
      "        if not self._vector_store.stores_text or self._store_nodes_override:\n",
      "            node_doc_ids = list(self.index_struct.nodes_dict.values())\n",
      "            nodes = self.docstore.get_nodes(node_doc_ids)\n",
      "\n",
      "            all_ref_doc_info = {}\n",
      "            for node in nodes:\n",
      "                ref_node = node.source_node\n",
      "                if not ref_node:\n",
      "                    continue\n",
      "\n",
      "                ref_doc_info = self.docstore.get_ref_doc_info(ref_node.node_id)\n",
      "                if not ref_doc_info:\n",
      "                    continue\n",
      "\n",
      "                all_ref_doc_info[ref_node.node_id] = ref_doc_info\n",
      "            return all_ref_doc_info\n",
      "        else:\n",
      "            raise NotImplementedError(\n",
      "                \"Vector store integrations that store text in the vector store are \"\n",
      "                \"not supported by ref_doc_info yet.\"\n",
      "            )\n",
      "\n"
     ]
    }
   ],
   "source": [
    "import inspect\n",
    "print(inspect.getsource(llama_index.core.VectorStoreIndex))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8125e2de",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": ".venv",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.11"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
 }
--- a/saved_output/2025_08_24.txt
+++ b/saved_output/2025_08_24.txt
@ -0,0 +1,52 @@
 Enter a search topic or question (or 'exit'):
 The mind as a terrible master.
 **Summary Theme:**
 This collection of excerpts explores the human mind's complex nature, its cognitive processes, perception, memory, and social
 interactions. The texts delve into how our thoughts are shaped by external stimuli, our brain's organizational patterns, and the
 emergence of consciousness. Additionally, they touch on the mind's tendency to create illusions (like "mass delusion") and the
 challenges posed by a distributed brain and decentralized consciousness.
 **Matching Files:**
 1. file_path: /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/data/2023-01-23.txt - The excerpt examines what becomes of an
 incessant critic in the mind, questioning the role of brain functions in creating negative self-perceptions.
 2. file_path: /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/data/2025-03-10.txt - The excerpt delves into the journal's
 effect on cognition, implying that it can profoundly influence thought processes even if not actively processing itself.
 3. file_path: /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/data/2024-08-18.txt - This excerpt suggests a comparison between
 a mind that is constantly on the move and an overloaded train, emphasizing the lack of central control.
 4. file_path: /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/data/2025-03-08.txt - The excerpt explores thoughts as
 "tendrils" that can hold a person and influence their mood and behavior, mirroring the mind's struggle with self-awareness.
 5. file_path: /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/data/2019-07-03.txt - This excerpt connects the idea of
 "impression management" to evolutionary adaptation, suggesting a mind capable of organizing individuals around "fictions."
 6. file_path: /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/data/2024-03-24.txt - The excerpt emphasizes the modularity of
 the mind and preconditioned responses, reflecting on E. Bruce Goldstein's book and its relevance to understanding consciousness.
 7. file_path: /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/data/2023-01-06.txt - A contemplation of memory's role as a
 "lump of coal that bears the delicate impression of a leaf," highlighting its complexity and non-linear nature.
 8. file_path: /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/data/2021-05-04.txt - The excerpt reflects on feelings of
 skepticism about human endeavors and the mind's tendency to imagine conflict, leading to negative self-judgment.
 9. file_path: /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/data/2023-01-23.txt - The excerpt references David Foster
 Wallace's quote about the mind as a "terrible master" and explores how individuals who commit suicide often shoot themselves in the
 head, silencing their inner critic.
 10. file_path: /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/data/2025-03-08.txt - This excerpt discusses the philosophical
 implications of a distributed brain and decentralized consciousness, questioning the existence of a singular "self" making
 decisions.
 Source documents:
 2009-08-24.txt /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/data/2009-08-24.txt 0.7286187534682154
 2023-01-23.txt /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/data/2023-01-23.txt 0.7174705749042735
 2023-03-09.txt /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/data/2023-03-09.txt 0.6905817844827031
 2023-01-06.txt /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/data/2023-01-06.txt 0.6872058770669452
 2021-05-04.txt /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/data/2021-05-04.txt 0.6866138676376796
 2022-04-17.txt /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/data/2022-04-17.txt 0.6837786406828062
 2025-03-10.txt /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/data/2025-03-10.txt 0.6825293816922051
 2021-05-02.txt /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/data/2021-05-02.txt 0.6818701242339038
 2025-03-08.txt /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/data/2025-03-08.txt 0.6804468955664654
 2024-03-24.txt /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/data/2024-03-24.txt 0.6798798323221176
 2022-02-24.txt /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/data/2022-02-24.txt 0.6779782723066287
 2024-08-18.txt /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/data/2024-08-18.txt 0.676507830756482
 2019-07-03.txt /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/data/2019-07-03.txt 0.6754137298987061
 2021-12-22.txt /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/data/2021-12-22.txt 0.6747843533262554
 2024-03-24.txt /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/data/2024-03-24.txt 0.6740836055290546
--- a/saved_output/2025_08_27.txt
+++ b/saved_output/2025_08_27.txt
@ -0,0 +1,65 @@
 Enter your query (or type 'exit' to quit): It's a weird refuge to refute nationality; to claim that all is a fraud, anyway. Still, it is the most sane reaction right now. The right to walk away. All else is slavery. 
 Loading llama_index.core.storage.kvstore.simple_kvstore from ./storage_exp/docstore.json.
 Loading llama_index.core.storage.kvstore.simple_kvstore from ./storage_exp/index_store.json.
 Response:
 **Summary Theme:**
 The texts explore complex issues related to freedom, equality, and justice, particularly within political and social contexts. 
 They discuss the limitations of current systems in addressing human rights and ethical standards, including prison treatment, 
 racial discrimination, and the expansion of citizenship rights. The texts also delve into the nature of revolutions, questioning 
 whether they benefit a select elite or are instigated by them.
 **Matching Files:**
 1. file_path: /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/data/2002-05-09.txt - This text discusses the treatment of 
 prisoners, highlighting a lack of adherence to constitutional standards and due process, as well as addressing issues related to 
 nationality and citizenship.
 2. file_path: /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/data/2025-06-04.txt - A blog post discussing the idea of 'human 
 nature' in politics, with a focus on the importance of freedom and the critique of a political figure's policies.
 3. file_path: /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/data/2024-10-13.txt - This passage explores the topic of white 
 male privilege and its implications in business, questioning why it is highlighted while overlooking broader systemic issues.
 4. file_path: /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/data/2025-01-03.txt - The text delves into the idea of freedom 
 and imagination in the early 20th century, comparing it to the present day and discussing the constraints of democracy and 
 capitalism.
 5. file_path: /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/data/2016-11-09.txt - A text addressing civil rights history 
 and the hate and anger that arise from it, questioning how to respond while avoiding tribalism and degeneration.
 6. file_path: /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/data/2025-07-28.txt - Discussing legal issues and the 
 challenges of thinking about human rights in a multicultural context, particularly regarding people who are neither Christians nor 
 infidels.
 7. file_path: /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/data/2025-01-29.txt - A personal reflection on current 
 political events and the author's difficulties in expressing feelings, tying into discussions of history and psychology.
 8. file_path: /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/data/2001-09-13.txt - This text addresses the justification for 
 violent acts during a perceived war against fundamentalism, emphasizing economic and moral principles over nationhood.
 9. file_path: /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/data/2025-06-03.txt - A passage on the concept of 
 incommensurability in politics and how consensus processes aim to reconcile different perspectives in practical situations.
 10. file_path: /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/data/2002-05-09.txt - Discussing the expansion of citizenship 
 rights in American history, highlighting key dates and milestones in this process.
 Source documents:
 2002-05-09.txt /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/data/2002-05-09.txt 0.6722719323340611
 2025-06-04.txt /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/data/2025-06-04.txt 0.6608581763116415
 2024-07-14.txt /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/data/2024-07-14.txt 0.6475284193414396
 2025-06-04.txt /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/data/2025-06-04.txt 0.6468059334833061
 2025-06-03.txt /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/data/2025-06-03.txt 0.6466041920182646
 2016-11-09.txt /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/data/2016-11-09.txt 0.6451955555687188
 2001-09-13.txt /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/data/2001-09-13.txt 0.6433104875230174
 2025-01-04.txt /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/data/2025-01-04.txt 0.6356563682194852
 2024-10-13.txt /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/data/2024-10-13.txt 0.6347407640363988
 2025-01-03.txt /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/data/2025-01-03.txt 0.6336626187333729
 2021-09-16.txt /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/data/2021-09-16.txt 0.6328042502815873
 2025-07-28.txt /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/data/2025-07-28.txt 0.6324342333276086
 2025-01-04.txt /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/data/2025-01-04.txt 0.6317671258192576
 2025-01-29.txt /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/data/2025-01-29.txt 0.6313280704571994
 2024-06-20.txt /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/data/2024-06-20.txt 0.629663790289146
 Query processed in 95 seconds.
--- a/saved_output/2025_08_28.txt
+++ b/saved_output/2025_08_28.txt
@ -0,0 +1,174 @@
 Enter your query (or type 'exit' to quit): I'm looking for the happiest and most joyful passages.
 Loading llama_index.core.storage.kvstore.simple_kvstore from ./storage_exp/docstore.json.
 Loading llama_index.core.storage.kvstore.simple_kvstore from ./storage_exp/index_store.json.
 Response:
 **Summary Theme:**
 The author reflects on moments of joy and happiness in their life, exploring themes such as contentment, love, and the beauty of 
 everyday experiences. They express a desire to let themselves be happy every day and find pleasure in creative pursuits like 
 poetry and art appreciation. Despite personal struggles with depression and anxiety, the author emphasizes the importance of 
 finding happiness in one's daily life.
 **Matching Files:**
 1. **file_path: /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/data/2015-12-03.txt** — Chloe's smile while praising her 
 piano playing brings joy, highlighting the author's appreciation for small acts of kindness.
 2. **file_path: /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/data/2006-08-11.txt** — The author's day feeding carrots to 
 horses and making pizza with Matthew is described as "fun times, maybe the best ever," showcasing their ability to find joy in 
 simple pleasures.
 3. **file_path: /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/data/2015-01-23.txt** — Reading poetry and appreciating 
 simple observations brings positive thoughts, indicating a focus on finding happiness through creative pursuits.
 4. **file_path: /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/data/2024-07-25.txt** — The author reflects on the joys of 
 their life, such as time with family and the love they experienced with T, despite later experiencing heartache.
 5. **file_path: /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/data/2013-02-15.txt** — The passage encourages being joyful, 
 happy, pleased, and glad, aligning with the author's overall theme of finding happiness in various life experiences.
 6. **file_path: /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/data/2019-09-14.txt** — Reflecting on a week of learning, 
 teaching, and feeling curious leads to the realization that one can find happiness every day, emphasizing the author's ability to 
 let themselves be happy.
 7. **file_path: /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/data/2012-09-20.txt** — This file, titled "Ευδαιμονía," 
 contains ancient Greek words related to happiness and well-being, further reinforcing the author's exploration of finding joy in 
 life.
 8. **file_path: /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/data/2025-07-29.txt** — The scorching weather and being 
 outside provide a backdrop to the author's ability to find happiness despite potential physical discomfort, demonstrating their 
 resilient outlook.
 9. **file_path: /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/data/2024-10-09.txt** — The author expresses frustration and 
 depression due to daily interactions but also acknowledges the importance of finding happiness in life, aligning with their 
 broader theme.
 10. **file_path: /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/data/2008-12-06.txt** — The passage defines happiness as 
 contentment and peacefulness, highlighting the author's pursuit of a joyful life through their experiences.
 Source documents:
 2025-07-29.txt /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/data/2025-07-29.txt 0.7135682886000794
 2008-12-06.txt /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/data/2008-12-06.txt 0.7099131243276414
 2009-06-04.txt /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/data/2009-06-04.txt 0.6973211899243362
 2025-08-20.txt /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/data/2025-08-20.txt 0.6866097119060084
 2013-02-15.txt /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/data/2013-02-15.txt 0.686259123672228
 2012-09-20.txt /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/data/2012-09-20.txt 0.6790148415972938
 2015-01-23.txt /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/data/2015-01-23.txt 0.6761073066656899
 2015-12-03.txt /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/data/2015-12-03.txt 0.6712531329880593
 2006-08-11.txt /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/data/2006-08-11.txt 0.6613670040827223
 2024-07-25.txt /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/data/2024-07-25.txt 0.6570111677987235
 2025-08-20.txt /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/data/2025-08-20.txt 0.6558116128405127
 2019-09-14.txt /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/data/2019-09-14.txt 0.6549423349658567
 2024-04-03.txt /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/data/2024-04-03.txt 0.6546862471469852
 2023-07-24.txt /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/data/2023-07-24.txt 0.6544076938168284
 2025-08-20.txt /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/data/2025-08-20.txt 0.6541587448214657
 Query processed in 73 seconds.
 ---
 This was a strange failure!
 ((.venv) ) ~/Library/CloudStorage/Dropbox/nd/ssearch/$ run_query.sh 
 Enter your query (or type 'exit' to quit): Find documents that express feelings of gratitude.
 Loading llama_index.core.storage.kvstore.simple_kvstore from ./storage_exp/docstore.json.
 Loading llama_index.core.storage.kvstore.simple_kvstore from ./storage_exp/index_store.json.
 Response:
 **Summary Theme:**
 The query is about finding documents expressing feelings of gratitude. However, it seems there was an error in my interpretation 
 or the context provided, as the dominant themes I identified earlier were related to depression and anxiety rather than gratitude. 
 Based on the given context, the theme that matches the query is related to personal struggles with mental health, particularly 
 feelings of sadness and appreciation for connections. 
 **Matching Files:**
 1. **/Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/data/2024-10-09.txt** — Expressed frustration with joggers on the bike 
 path but did not mention gratitude.
 2. **/Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/data/2025-08-20.txt** — No direct expressions of gratitude found, but a 
 reflection on personal struggles and achievements was present.
 3. **/Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/data/2008-05-27.txt** — Focuses on negative emotions like anxiety and 
 anger, with no clear expressions of gratitude.
 4. **/Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/data/2013-05-23.txt** — Mentions the joy of helping others achieve their 
 goals, which could be interpreted as a form of appreciation or gratitude for their success and recognition.
 5. **/Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/data/2015-03-17.txt** — Contains suicidal thoughts and negative 
 feelings, indicating a lack of gratitude.
 6. **/Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/data/2023-07-16.txt** — Describes feelings of loss and the search for 
 meaning, devoid of expressions of gratitude.
 7. **/Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/data/2025-03-18.txt** — No clear mentions of gratitude found.
 8. **/Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/data/2007-05-31.txt** — Focuses on career concerns and negative 
 emotions, without expressing gratitude.
 9. **/Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/data/2020-05-13.txt** — Struggles with recognizing others' efforts due 
 to internal bad feelings, which contrasts the idea of gratitude.
 10. **/Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/data/2020-02-01.txt** — Mentions reconnecting with old friendships and 
 family, but there are no explicit expressions of gratitude.
 Source documents:
 2025-08-20.txt /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/data/2025-08-20.txt 0.6865291287082457
 2008-05-27.txt /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/data/2008-05-27.txt 0.6707430757786356
 2023-02-17.txt /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/data/2023-02-17.txt 0.6624994985797085
 2025-08-20.txt /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/data/2025-08-20.txt 0.6614406157945066
 2025-03-18.txt /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/data/2025-03-18.txt 0.6589271548285772
 2025-08-20.txt /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/data/2025-08-20.txt 0.6583888795181797
 2025-07-28.txt /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/data/2025-07-28.txt 0.6575634356770015
 2012-09-20.txt /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/data/2012-09-20.txt 0.6564913212073614
 2020-05-13.txt /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/data/2020-05-13.txt 0.6563809376620068
 2025-08-20.txt /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/data/2025-08-20.txt 0.6549296468531686
 2013-05-23.txt /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/data/2013-05-23.txt 0.653871795081564
 2009-06-04.txt /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/data/2009-06-04.txt 0.6535844277567499
 2007-05-31.txt /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/data/2007-05-31.txt 0.6524713123412845
 2025-07-29.txt /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/data/2025-07-29.txt 0.6517446358739963
 2020-02-01.txt /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/data/2020-02-01.txt 0.6514433384900066
 Query processed in 74 seconds.
 ---
 I implemented a regex that srtips the full path:
 ((.venv) ) ~/Library/CloudStorage/Dropbox/nd/ssearch/$ run_query.sh 
 Enter your query (or type 'exit' to quit): Entries that discuss testing one's limits, especially emotional and mental.
 Loading llama_index.core.storage.kvstore.simple_kvstore from ./storage_exp/docstore.json.
 Loading llama_index.core.storage.kvstore.simple_kvstore from ./storage_exp/index_store.json.
 Response:
 **Summary Theme:**
 The dominant theme in this context appears to be an individual exploring their emotions, particularly their mental and emotional 
 boundaries, as well as the impact of societal perceptions on feelings. The writer grapples with anxiety, depression, self-worth, 
 and the fear of inadequacy while also contemplating their own mortality and purpose. They seek to understand and manage their 
 emotions, often viewing them as data or information that can guide survival and informed decision-making.
 **Matching Files:**
 1. **file_path: ./data/2023-07-16.txt** — Describes the struggle of wrestling with depression for years, emphasizing the search 
 for meaning in a world driven by efficiency and optimization.
 2. **file_path: ./data/2015-03-17.txt** — Mentions suicidal thoughts and feeling overwhelmed by negative emotions, indicating a 
 desire to test one's limits emotionally.
 3. **file_path: ./data/2019-01-14.txt** — Discusses the struggle with controlling impulses and feelings of stress, anxiety, and 
 depression while questioning if one is a prisoner of their biology.
 4. **file_path: ./data/2025-06-17.txt** — Explores the concept of feeling out personal boundaries and accepting dissonance, which 
 could be seen as testing emotional limits.
 5. **file_path: ./data/2025-08-20.txt** — Mentions the interest in anarchy while being invested in capital markets and holding a 
 tenured position, indicating a potential exploration of one's limits.
 6. **file_path: ./data/2017-12-06.txt** — Expresses suicidal thoughts due to burnout and emotional exhaustion, suggesting an 
 attempt to test personal boundaries.
 7. **file_path: ./data/2017-12-16.txt** — Explores the desire to be a better person and the struggle with balance, potentially 
 indicating a journey of testing one's limits.
 8. **file_path: ./data/2017-04-13.txt** — Focuses on worrying about hypotheticals and imagined fights, suggesting an exploration 
 of personal boundaries and emotional limits.
 9. **file_path: ./data/2024-09-20.txt** — Admitted to having depressive thoughts despite appearing jovial, indicating a discussion 
 on testing the limits of one's mental health.
 10. **file_path: ./data/2025-08-20.txt** — The computer facilitates artistic innovation by freeing the artist from conventional 
 "mental ready-mades," enabling the production of new assemblages of shapes and colors.
 Source documents:
 2019-01-28.txt ./data/2019-01-28.txt 0.7091032318236316
 2003-03-09.txt ./data/2003-03-09.txt 0.6819464422399241
 2025-08-20.txt ./data/2025-08-20.txt 0.6796124657599102
 2025-08-20.txt ./data/2025-08-20.txt 0.6785008440538487
 2017-04-13.txt ./data/2017-04-13.txt 0.6768340197245936
 2022-05-06.txt ./data/2022-05-06.txt 0.6750801120630013
 2023-01-27.txt ./data/2023-01-27.txt 0.6703347559624786
 2023-03-14.txt ./data/2023-03-14.txt 0.668340287632692
 2025-06-17.txt ./data/2025-06-17.txt 0.6656929175939117
 2025-08-20.txt ./data/2025-08-20.txt 0.6645024849162311
 2023-07-16.txt ./data/2023-07-16.txt 0.6618312766890652
 2021-04-15.txt ./data/2021-04-15.txt 0.661171288633267
 2025-08-20.txt ./data/2025-08-20.txt 0.6600615010925119
 2019-01-14.txt ./data/2019-01-14.txt 0.6563840810491259
 2025-05-23.txt ./data/2025-05-23.txt 0.6561484407217757
 Query processed in 79 seconds.
--- a/saved_output/2025_08_30.txt
+++ b/saved_output/2025_08_30.txt
@ -0,0 +1,124 @@
 ((.venv) ) ~/Library/CloudStorage/Dropbox/nd/ssearch/$ run_query.sh 
 Enter your query (or type 'exit' to quit): You are a machine and you can program yourself.
 Loading llama_index.core.storage.kvstore.simple_kvstore from ./storage_exp/docstore.json.
 Loading llama_index.core.storage.kvstore.simple_kvstore from ./storage_exp/index_store.json.
 Response:
 ### Summary Theme
 The dominant theme from the provided context revolves around the concept of human agency and the transformative power of 
 self-programming, often referred to as "YOU ARE A MACHINE AND YOU CAN PROGRAM YOURSELF". This phrase encapsulates the idea that 
 individuals have the capability to upgrade their own mental and emotional states through conscious effort and learning. The 
 context explores this theme through various angles, including philosophical reflections on storytelling and human existence, 
 discussions of meditation and personal growth, and explorations of information replication and self-replication systems like DNA 
 and memes.
 ### Matching Files
 1. **file_path: ./data/2024-09-21.txt** - The passage discusses Von Neumann's theory of self-replication, which ties into the idea 
 that individuals can replicate and "program" themselves through learning and development.
 2. **file_path: ./data/2024-08-19.txt** - This snippet features a series of repetitions of "YOU ARE A MACHINE AND YOU CAN PROGRAM 
 YOURSELF," emphasizing the transformative power of self-programming and personal growth.
 3. **file_path: ./data/2024-03-11.txt** - The author contemplates purpose, meaning, and the act of programming oneself through 
 writing and reflection, aligning with the self-programming theme.
 4. **file_path: ./data/2023-07-16.txt** - A robot's inner struggles, including a desire to explore feelings and understand its own 
 existence, hint at the idea of self-programming and personal development.
 5. **file_path: ./data/2024-04-11.txt** - The author's fascination with the materiality of computation suggests a connection to 
 understanding human existence through self-programming and self-replication.
 6. **file_path: ./data/2024-02-13.txt** - The extensive list of tasks and projects the author wants to accomplish reflects a drive 
 for personal growth, akin to self-programming.
 7. **file_path: ./data/2024-08-19.txt** - A discussion of information degradation and the relationship between DNA, memes, and 
 information replication hints at a deeper understanding through self-programming.
 8. **file_path: ./data/2025-01-24.txt** - A detailed account of typing a program onto IBM cards and working with early computers 
 emphasizes the labor involved in creating and learning from technology, which can be seen as self-programming through the 
 acquisition of knowledge.
 9. **file_path: ./data/2025-02-05.txt** - This passage raises questions about truth and human agency, which are closely tied to 
 the idea of self-programming and the ability to shape one's own existence through conscious effort.
 10. **file_path: ./data/2022-01-22.txt** - The author relates meditation to "operating system updates," illustrating how 
 self-programming can lead to improved performance and functionality in the mind, much like a computer's software.
 Source documents:
 2021-02-12.txt ./data/2021-02-12.txt 0.7813853524305151
 2021-03-12.txt ./data/2021-03-12.txt 0.7170262020422805
 2021-03-22.txt ./data/2021-03-22.txt 0.7080438590471859
 2025-02-05.txt ./data/2025-02-05.txt 0.700772619041579
 2022-01-22.txt ./data/2022-01-22.txt 0.6946526808142116
 2024-08-19.txt ./data/2024-08-19.txt 0.6909295863339957
 2024-09-21.txt ./data/2024-09-21.txt 0.6863798746276172
 2024-08-19.txt ./data/2024-08-19.txt 0.6811521050296564
 2024-03-11.txt ./data/2024-03-11.txt 0.6776553751255855
 2023-07-16.txt ./data/2023-07-16.txt 0.6734772841938028
 2021-03-01.txt ./data/2021-03-01.txt 0.6703476982962236
 2025-05-26.txt ./data/2025-05-26.txt 0.6699061717036373
 2024-02-13.txt ./data/2024-02-13.txt 0.6675189407579228
 2025-01-24.txt ./data/2025-01-24.txt 0.6661259191158485
 2024-04-11.txt ./data/2024-04-11.txt 0.664854046786588
 Query processed in 97 seconds.
 Enter your query (or type 'exit' to quit): Summarize passages related to questions about truth and human agency.
 Loading llama_index.core.storage.kvstore.simple_kvstore from ./storage_exp/docstore.json.
 Loading llama_index.core.storage.kvstore.simple_kvstore from ./storage_exp/index_store.json.
 Response:
 **Summary Theme:**
 The texts explore the relationship between truth, knowledge, and human agency, highlighting how our understanding of reality is 
 shaped by interpretation and negotiation rather than an objective standard. They question the nature of self-awareness and 
 consciousness, suggesting that it arises from independent facts and truths beyond individual control. This perspective challenges 
 traditional notions of knowledge and ethics, suggesting that shared meaning and identity might be more influential than facts 
 themselves. The theme also delves into the implications of these ideas for governance, society, and the integration of diverse 
 epistemic frameworks.
 **Matching Files:**
 1. ./data/2025-03-08.txt - The passage emphasizes that truth is not fixed but a product of human interpretation, challenging the 
 idea of absolute knowledge and suggesting that our beliefs are subject to revision and refinement as we learn more about ourselves 
 and the world.
 2. ./data/2025-02-14.txt - This snippet discusses truth as a social construct, prompting questions about ethical epistemology and 
 the potential role of AI in shaping epistemic environments. It also introduces the idea that knowledge is a product of shared 
 meaning and identity rather than just facts.
 3. ./data/2025-02-14.txt - The text includes a quote suggesting that our concepts are self-created, which can both empower and 
 limit us, fostering critical thinking and personal responsibility while also potentially leading to disorientation and existential 
 uncertainty.
 4. ./data/2025-03-08.txt - The implications of truth being a social construct are explored further, including the idea that 
 fact-checking alone doesn't address shared meaning and identity, leading to discussions about ethics, society, governance, and the 
 role of AI in shaping epistemic environments.
 5. ./data/2025-03-08.txt - This file delves into the concept of "we are supplicants to our own fiction," exploring how humans 
 create meaning systems that can be comforting but potentially misleading or limiting, and emphasizing the importance of 
 self-awareness for critical thinking and personal growth.
 6. ./data/2025-02-14.txt - The text raises questions about the nature of truth, reality, and human agency, inviting contemplation 
 on whether our stories are mere constructs or reflections of deeper aspects of human existence, and how we can navigate 
 storytelling to uncover accurate portrayals.
 7. ./data/2025-03-08.txt - This passage continues the discussion on the implications of a social construct of truth, including the 
 potential role of AI in mediating competing epistemic frameworks and reducing polarization.
 8. ./data/2010-12-30.txt - A statement that "there is no meaning and no purpose in life" is discussed, reflecting existentialist 
 philosophies, and raising questions about reconciling the tension between fiction and the search for meaning and purpose.
 9. ./data/2025-03-08.txt - This snippet presents a list of resources related to knowledge, power, and institutions, including 
 Michel Foucault, Donna Haraway, Stefan Lorenz Sorgner, William James, and Noam Chomsky, reflecting on the relationship between 
 power and knowledge throughout history.
 Source documents:
 2006-12-27.txt ./data/2006-12-27.txt 0.7103349341421636
 2025-02-14.txt ./data/2025-02-14.txt 0.6992770918721224
 2025-03-08.txt ./data/2025-03-08.txt 0.686001774445945
 2025-02-14.txt ./data/2025-02-14.txt 0.6743349162123844
 2025-03-08.txt ./data/2025-03-08.txt 0.6733934128354977
 2025-03-08.txt ./data/2025-03-08.txt 0.6706689033144045
 2025-02-05.txt ./data/2025-02-05.txt 0.6702486733668184
 2025-01-04.txt ./data/2025-01-04.txt 0.6699433363201491
 2025-02-14.txt ./data/2025-02-14.txt 0.6691576672622886
 2025-03-08.txt ./data/2025-03-08.txt 0.6670311145975771
 2008-04-22.txt ./data/2008-04-22.txt 0.665624998848253
 2025-02-06.txt ./data/2025-02-06.txt 0.6654464518589284
 2010-12-30.txt ./data/2010-12-30.txt 0.663147445474458
 2004-02-15.txt ./data/2004-02-15.txt 0.6625948924633361
 2025-02-06.txt ./data/2025-02-06.txt 0.6608789240071589
 Query processed in 92 seconds.
--- a/saved_output/README.txt
+++ b/saved_output/README.txt
@ -0,0 +1 @@
 This directory contains collections of interesting output from the nd_ssearch query engine.
--- a/saved_output/compare
+++ b/saved_output/compare
@ -0,0 +1,113 @@
 # Generaterd by llama3.1:8B
 Enter a search topic or question (or 'exit'):
 Simplicity, peace, and acceptance.
 **Summary Theme**
 The dominant theme that emerges from the provided context is the pursuit of simplicity, peace, and acceptance as a means to find
 meaning and contentment in life. The excerpts suggest that individuals often struggle with existential crises, anxiety, and
 dissatisfaction, but through various philosophical and spiritual practices, they seek to cultivate a sense of inner peace and
 harmony.
 **Matching Files**
 1. **file_path: /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/data/2008-09-22.txt** — Suzuki discusses the importance of
 letting go of intellectual pursuits and embracing the simplicity of life, feeling the "power of life" and being content with its
 evolution.
 2. **file_path: /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/data/2015-02-12.txt** — The author reflects on their emptiness
 and yearning for something deeper, wondering if they can relish in this feeling and explore it further.
 3. **file_path: /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/data/2008-02-27.txt** — Life is described as pain, but the
 author finds solace in God and feels a deep connection to something universal, seeking to do good for others.
 4. **file_path: /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/data/2012-09-08.txt** — The text advises living life with
 virtue above pleasure and tranquility above happiness, finding contentment in the present moment.
 5. **file_path: /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/data/2007-05-30.txt** — The author reflects on their
 existential crisis and seeks to find meaning through meditation, exploring the wonder of life and the universe.
 6. **file_path: /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/data/2025-02-05.txt** — Alan Watts' measured pace and calm
 tone are mentioned as soothing, inviting listeners to slow down and absorb his philosophical musings.
 7. **file_path: /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/data/2003-11-11.txt** — The author experiences a sense of
 dissolution of the self and acceptance of mortality through music, finding hope in the peaceful and calm aspects of human
 existence.
 8. **file_path: /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/data/2011-04-20.txt** — The text presents two contrasting
 choices for raising children: to learn how to fight or navigate complex social interactions, leading to exhaustion and missteps.
 9. **file_path: /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/data/2003-11-11.txt** — Music is mentioned as a means to
 connect with those who have passed and accept one's own mortality, finding hope in acceptance.
 10. **file_path: /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/data/2024-07-09.txt** — The author reflects on their own
 flaws and limitations, seeking to cultivate peace and tranquility through meditation and philosophical insights.
 These files prioritize the pursuit of simplicity, peace, and acceptance as a means to find meaning and contentment in life.
 Source documents:
 2008-09-22.txt /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/data/2008-09-22.txt 0.6942251115364034
 2008-12-06.txt /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/data/2008-12-06.txt 0.69329826339918
 2006-01-18.txt /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/data/2006-01-18.txt 0.6928289731862756
 2024-07-09.txt /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/data/2024-07-09.txt 0.6851764077064736
 2024-10-21.txt /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/data/2024-10-21.txt 0.6850957394630768
 2021-05-15.txt /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/data/2021-05-15.txt 0.6792933726282656
 2025-07-28.txt /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/data/2025-07-28.txt 0.677805020905822
 2008-02-27.txt /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/data/2008-02-27.txt 0.6758790431127354
 2025-02-05.txt /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/data/2025-02-05.txt 0.6751368515758781
 2003-11-11.txt /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/data/2003-11-11.txt 0.6741012107117246
 2015-02-12.txt /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/data/2015-02-12.txt 0.6724595748596391
 2012-09-08.txt /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/data/2012-09-08.txt 0.6709374012666305
 2007-05-30.txt /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/data/2007-05-30.txt 0.6705586893086012
 2003-11-11.txt /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/data/2003-11-11.txt 0.6683483221331056
 2011-04-20.txt /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/data/2011-04-20.txt 0.6679376031898634
 # Generated by command-r7b
 Enter a search topic or question (or 'exit'):
 Simplicity, peace, and acceptance.
 **Summary Theme:**
 The dominant themes from the provided context revolve around personal growth, spirituality, and finding contentment through
 simplicity, peace, and acceptance. The text emphasizes the idea that one should let go of intellectual pursuits and anxiety to feel
 more content with life's challenges and find inner peace. It explores the concepts of Zen, tranquility, and the power of meditation
 as a path to achieving this state of being. Additionally, the texts touch on the impact of personal experiences (like the birth of
 a child) in fostering a sense of calm and connection with the divine.
 **Matching Files:**
 1. /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/data/2008-09-22.txt - D.T. Suzuki's excerpt highlights the importance of
 letting go of intellectual pursuits and focusing on the present moment to achieve peace and contentment, aligning with themes of
 simplicity and acceptance.
 2. /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/data/2015-02-12.txt - This file discusses the emptiness and yearning for
 deeper meaning, suggesting a journey towards personal peace and content in one's spiritual quest.
 3. /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/data/2008-12-06.txt - The excerpt promotes happiness through contentment
 and peacefulness, echoing the themes of simplicity and acceptance.
 4. /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/data/2003-11-11.txt - The text explores the dissolution of the self and
 finding peace in spiritual experiences, relating to themes of simplification and acceptance.
 5. /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/data/2024-10-21.txt - A reflection on human nature and violence, this
 passage emphasizes the importance of letting go of aggression and finding peace through a peaceful mind.
 6. /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/data/2025-07-28.txt - This file mentions consensus-based processes and
 Quaker values, which align with the themes of acceptance and tranquility in decision-making.
 7. /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/data/2012-09-08.txt - The excerpt promotes the idea of regulating thoughts
 and actions with a view towards mortality, highlighting simplicity and tranquility in preparation for life's end.
 8. /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/data/2007-05-30.txt - The text discusses existential crises and finding
 meaning through meditation and therapy, contributing to the theme of simplification through introspection.
 9. /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/data/2011-04-20.txt - While more focused on child-rearing, this passage
 hints at themes of acceptance and tranquility in navigating complex social interactions.
 10. /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/data/2003-11-11.txt - The exploration of prayer and spiritual experiences
 leads to the theme of acceptance as one sheds the veneer of life, revealing deeper human existence.
 Source documents:
 2008-09-22.txt /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/data/2008-09-22.txt 0.6942251115364034
 2008-12-06.txt /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/data/2008-12-06.txt 0.69329826339918
 2006-01-18.txt /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/data/2006-01-18.txt 0.6928289731862756
 2024-07-09.txt /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/data/2024-07-09.txt 0.6851764077064736
 2024-10-21.txt /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/data/2024-10-21.txt 0.6850957394630768
 2021-05-15.txt /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/data/2021-05-15.txt 0.6792933726282656
 2025-07-28.txt /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/data/2025-07-28.txt 0.677805020905822
 2008-02-27.txt /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/data/2008-02-27.txt 0.6758790431127354
 2025-02-05.txt /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/data/2025-02-05.txt 0.6751368515758781
 2003-11-11.txt /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/data/2003-11-11.txt 0.6741012107117246
 2015-02-12.txt /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/data/2015-02-12.txt 0.6724595748596391
 2012-09-08.txt /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/data/2012-09-08.txt 0.6709374012666305
 2007-05-30.txt /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/data/2007-05-30.txt 0.6705586893086012
 2003-11-11.txt /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/data/2003-11-11.txt 0.6683483221331056
 2011-04-20.txt /Users/furst/Library/CloudStorage/Dropbox/nd/ssearch/data/2011-04-20.txt 0.6679376031898634
--- a/search_keywords.py
+++ b/search_keywords.py
@ -0,0 +1,189 @@
 # search_keywords.py
 # Keyword search: extract terms from a query using POS tagging, then grep
 # across journal files for matches.
 #
 # Complements the vector search pipeline by catching exact names, places,
 # and dates that embeddings can miss. No vector store or LLM needed.
 #
 # Term extraction uses NLTK POS tagging to keep nouns (NN*), proper nouns
 # (NNP*), and adjectives (JJ*) -- skipping stopwords and function words
 # automatically. Consecutive proper nouns are joined into multi-word phrases
 # (e.g., "Robert Wright" stays as one search term, not "robert" + "wright").
 #
 # E.M.F. February 2026
 import os
 import sys
 import re
 from pathlib import Path
 import nltk
 #
 # Globals
 #
 DATA_DIR = Path("./data")
 CONTEXT_LINES = 2       # lines of context around each match
 MAX_MATCHES_PER_FILE = 3  # cap matches shown per file to avoid flooding
 # POS tags to keep: nouns, proper nouns, adjectives
 KEEP_TAGS = {"NN", "NNS", "NNP", "NNPS", "JJ", "JJS", "JJR"}
 # Proper noun tags (consecutive runs are joined as phrases)
 PROPER_NOUN_TAGS = {"NNP", "NNPS"}
 # Minimum word length to keep (filters out short noise)
 MIN_WORD_LEN = 3
 def ensure_nltk_data():
    """Download NLTK data if not already present."""
    for resource, name in [
        ("tokenizers/punkt_tab", "punkt_tab"),
        ("taggers/averaged_perceptron_tagger_eng", "averaged_perceptron_tagger_eng"),
    ]:
        try:
            nltk.data.find(resource)
        except LookupError:
            print(f"Downloading NLTK resource: {name}")
            nltk.download(name, quiet=True)
 def extract_terms(query):
    """Extract key terms from a query using POS tagging.
    Tokenizes the query, runs POS tagging, and keeps nouns, proper nouns,
    and adjectives. Consecutive proper nouns (NNP/NNPS) are joined into
    multi-word phrases (e.g., "Robert Wright" → "robert wright").
    Returns a list of terms (lowercase), phrases listed first.
    """
    tokens = nltk.word_tokenize(query)
    tagged = nltk.pos_tag(tokens)
    phrases = []     # multi-word proper noun phrases
    single_terms = []  # individual nouns/adjectives
    proper_run = []  # accumulator for consecutive proper nouns
    for word, tag in tagged:
        if tag in PROPER_NOUN_TAGS:
            proper_run.append(word)
        else:
            # Flush any accumulated proper noun run
            if proper_run:
                phrase = " ".join(proper_run).lower()
                if len(phrase) >= MIN_WORD_LEN:
                    phrases.append(phrase)
                proper_run = []
            # Keep other nouns and adjectives as single terms
            if tag in KEEP_TAGS and len(word) >= MIN_WORD_LEN:
                single_terms.append(word.lower())
    # Flush final proper noun run
    if proper_run:
        phrase = " ".join(proper_run).lower()
        if len(phrase) >= MIN_WORD_LEN:
            phrases.append(phrase)
    # Phrases first (more specific), then single terms
    all_terms = phrases + single_terms
    return list(dict.fromkeys(all_terms))  # deduplicate, preserve order
 def search_files(terms, data_dir, context_lines=CONTEXT_LINES):
    """Search all .txt files in data_dir for the given terms.
    Returns a list of (file_path, match_count, matches) where matches is a
    list of (line_number, context_block) tuples.
    """
    if not terms:
        return []
    # Build a single regex pattern that matches any term (case-insensitive)
    pattern = re.compile(
        r"\b(" + "|".join(re.escape(t) for t in terms) + r")\b",
        re.IGNORECASE
    )
    results = []
    txt_files = sorted(data_dir.glob("*.txt"))
    for fpath in txt_files:
        try:
            lines = fpath.read_text(encoding="utf-8").splitlines()
        except (OSError, UnicodeDecodeError):
            continue
        matches = []
        match_count = 0
        seen_lines = set()  # avoid overlapping context blocks
        for i, line in enumerate(lines):
            if pattern.search(line):
                match_count += 1
                if i in seen_lines:
                    continue
                # Extract context window
                start = max(0, i - context_lines)
                end = min(len(lines), i + context_lines + 1)
                block = []
                for j in range(start, end):
                    seen_lines.add(j)
                    marker = ">>>" if j == i else "   "
                    block.append(f"  {marker} {j+1:4d}: {lines[j]}")
                matches.append((i + 1, "\n".join(block)))
        if match_count > 0:
            results.append((fpath, match_count, matches))
    # Sort by match count (most matches first)
    results.sort(key=lambda x: x[1], reverse=True)
    return results
 def main():
    if len(sys.argv) < 2:
        print("Usage: python search_keywords.py QUERY_TEXT")
        sys.exit(1)
    ensure_nltk_data()
    q = " ".join(sys.argv[1:])
    # Extract terms
    terms = extract_terms(q)
    if not terms:
        print(f"Query: {q}")
        print("No searchable terms extracted. Try a more specific query.")
        sys.exit(0)
    print(f"Query: {q}")
    print(f"Extracted terms: {', '.join(terms)}\n")
    # Search
    results = search_files(terms, DATA_DIR)
    if not results:
        print("No matches found.")
        sys.exit(0)
    # Summary
    total_matches = sum(r[1] for r in results)
    print(f"Found {total_matches} matches across {len(results)} files\n")
    # Detailed output
    for fpath, match_count, matches in results:
        print("="*60)
        print(f"--- {fpath.name}  ({match_count} matches) ---")
        print("="*60)
        for line_num, block in matches[:MAX_MATCHES_PER_FILE]:
            print(block)
            print()
        if len(matches) > MAX_MATCHES_PER_FILE:
            print(f"  ... and {len(matches) - MAX_MATCHES_PER_FILE} more matches\n")
 if __name__ == "__main__":
    main()
--- a/tests/README.md
+++ b/tests/README.md
@ -0,0 +1,29 @@
 # LLM Comparison Tests
 Query used for all tests: **"Passages that quote Louis Menand."**
 Script: `query_hybrid_bm25_v4.py` (hybrid BM25 + vector, cross-encoder re-rank to top 15)
 Retrieval is identical across all tests (same 15 chunks, same scores).
 Only the LLM synthesis step differs.
 File naming: `results_<model>_t<temperature>.txt`
 ## Results
 | File | LLM | Temperature | Files cited | Time | Notes |
 |------|-----|-------------|-------------|------|-------|
 | `results_gpt4omini_t0.1.txt` | gpt-4o-mini (OpenAI API) | 0.1 | 6 | 44s | Broader coverage, structured numbered list, drew from chunks ranked as low as #14 |
 | `results_commandr7b_t0.8.txt` | command-r7b (Ollama local) | 0.8 (default) | 2 | 78s | Focused on top chunks, reproduced exact quotes verbatim |
 | `results_gpt4omini_t0.3.txt` | gpt-4o-mini (OpenAI API) | 0.3 | 6 | 45s | Very similar to 0.1 run -- same 6 files, same structure, slightly more interpretive phrasing |
 | `results_commandr7b_t0.3.txt` | command-r7b (Ollama local) | 0.3 | 6 | 94s | Major improvement over 0.8 default: cited 6 files (was 2), drew from lower-ranked chunks including 2024-08-03 (#15) |
 ## Observations
 - Lowering command-r7b from 0.8 to 0.3 dramatically improved breadth (2 → 6 files cited).
  At 0.8, the model focused narrowly on the top-scored chunks. At 0.3, it used the full
  context window much more effectively.
 - gpt-4o-mini showed little difference between 0.1 and 0.3. It already used the full
  context at 0.1. The API model appears less sensitive to temperature for this task.
 - command-r7b at 0.3 took longer (94s vs 78s), likely due to generating more text.
 - At temperature=0.3, both models converge on similar quality: 6 files cited, good
  coverage of the context window, mix of direct quotes and paraphrases.
--- a/tests/results_commandr7b_t0.3.txt
+++ b/tests/results_commandr7b_t0.3.txt
@ -0,0 +1,54 @@
 Query: --query Passages that quote Louis Menand.
 Vector: 20, BM25: 20, overlap: 7, merged: 33, re-ranked to: 15
 Response:
 The provided excerpts contain several references to Louis Menand's ideas and quotes. Here are the specific passages:
 - In file_path: ./data/2025-11-04.txt, you can find a direct quote from Louis Menand: "We created God, and then pretended that God 
 created us. We hypostatized our own concept and turned it into something “out there” whose commandments (which we made up) we 
 struggle to understand and obey. We are supplicants to our own fiction."
 - In file_path: ./data/2025-02-14.txt, there is a reference to Menand's quote about the human tendency to create and interact with 
 abstract ideas as if they are tangible realities.
 - In file_path: ./data/2022-08-14.txt, another excerpt from Menand's work is mentioned: "We created God, and then we pretended 
 that God created us."
 - In file_path: ./data/2025-07-27.txt, the author discusses Menand's quote about the creation of fictions and the role of 
 organization in human culture.
 - In file_path: ./data/2024-09-06.txt, there is a mention of Menand's writing style, describing him as "witty and serious."
 - In file_path: ./data/2025-02-14.txt, the poem takes a wistful tone, hinting at the fragility of human attachment to concepts, 
 which is reminiscent of existential crises and Menand's ideas on hypostasis.
 - In file_path: ./data/2025-07-27.txt, the author further elaborates on Menand's quote, emphasizing the human tendency to create 
 and interact with fictions.
 - In file_path: ./data/2024-08-03.txt, there is a reference to Louis Menand's book "The Free World" and its exploration of art, 
 literature, and culture in the 20th century.
 All these excerpts contribute to understanding Louis Menand's ideas on hypostasis, human creation of fictions, and the complex 
 relationship between stories, beliefs, and reality.
 Files contributing to the answer:
 - ./data/2025-11-04.txt
 - ./data/2025-02-14.txt
 - ./data/2022-08-14.txt
 - ./data/2025-07-27.txt
 - ./data/2024-09-06.txt
 - ./data/2025-02-14.txt
 - ./data/2025-07-27.txt
 - ./data/2024-08-03.txt
 Source documents:
 2025-11-04.txt  ./data/2025-11-04.txt  3.303
 2025-02-14.txt  ./data/2025-02-14.txt  3.144
 2022-08-14.txt  ./data/2022-08-14.txt  2.183
 2021-07-04.txt  ./data/2021-07-04.txt  1.595
 2025-02-14.txt  ./data/2025-02-14.txt  1.504
 2025-07-27.txt  ./data/2025-07-27.txt  0.171
 2024-07-26.txt  ./data/2024-07-26.txt  -0.969
 2024-09-06.txt  ./data/2024-09-06.txt  -1.661
 2024-04-08.txt  ./data/2024-04-08.txt  -2.017
 2025-02-05.txt  ./data/2025-02-05.txt  -2.027
 2022-04-17.txt  ./data/2022-04-17.txt  -2.054
 2023-07-01.txt  ./data/2023-07-01.txt  -2.284
 2023-01-06.txt  ./data/2023-01-06.txt  -2.315
 2024-09-04.txt  ./data/2024-09-04.txt  -2.379
 2024-08-03.txt  ./data/2024-08-03.txt  -2.391
 Query processed in 94 seconds.
--- a/tests/results_commandr7b_t0.8.txt
+++ b/tests/results_commandr7b_t0.8.txt
@ -0,0 +1,58 @@
 Query: --query Passages that quote Louis Menand.
 Vector: 20, BM25: 20, overlap: 7, merged: 33, re-ranked to: 15
 Response:
 The provided excerpts feature multiple references to Louis Menand's quote about hypostasis and our creation of God as a fiction. 
 Here are the relevant passages:
 - **"We created God, and then pretended that God created us."** **(file_path: ./data/2025-11-04.txt)**
 This excerpt introduces Menand's quote as the starting point for an exploration of how humans create and interact with abstract 
 concepts.
 - **"We hypostatized our own concept and turned it into something “out there” whose commandments (which we made up) we struggle to 
 understand and obey."** **(file_path: ./data/2025-11-04.txt)**
 This quote delves further into the idea of turning concepts into something tangible, emphasizing the struggle to understand and 
 follow the rules we create.
 - **"We are supplicants to our own fiction."** **(file_path: ./data/2025-11-04.txt)**
 The passage concludes by emphasizing the concept of being subservient to our own creations, highlighting the complex relationship 
 between human stories and reality.
 - **"We created God, and then pretended that God created us."** **(file_path: ./data/2022-08-14.txt)**
 This excerpt reiterates the core idea from Menand's quote in a slightly different phrasing.
 - **"We hypostatized our own concept and turned it into something “out there” whose commandments, which we also made up, we 
 struggle to understand and obey."** **(file_path: ./data/2022-08-14.txt)**
 This quote emphasizes the process of creating rules and struggling to comprehend them while adhering to them.
 All these excerpts contribute to exploring Menand's perspective on how humans create, project, and interact with abstract 
 concepts, highlighting the complex relationship between stories, beliefs, and reality.
 **Files contributing:**
 * **2025-11-04.txt**
 * **2022-08-14.txt**
 Source documents:
 2025-11-04.txt  ./data/2025-11-04.txt  3.303
 2025-02-14.txt  ./data/2025-02-14.txt  3.144
 2022-08-14.txt  ./data/2022-08-14.txt  2.183
 2021-07-04.txt  ./data/2021-07-04.txt  1.595
 2025-02-14.txt  ./data/2025-02-14.txt  1.504
 2025-07-27.txt  ./data/2025-07-27.txt  0.171
 2024-07-26.txt  ./data/2024-07-26.txt  -0.969
 2024-09-06.txt  ./data/2024-09-06.txt  -1.661
 2024-04-08.txt  ./data/2024-04-08.txt  -2.017
 2025-02-05.txt  ./data/2025-02-05.txt  -2.027
 2022-04-17.txt  ./data/2022-04-17.txt  -2.054
 2023-07-01.txt  ./data/2023-07-01.txt  -2.284
 2023-01-06.txt  ./data/2023-01-06.txt  -2.315
 2024-09-04.txt  ./data/2024-09-04.txt  -2.379
 2024-08-03.txt  ./data/2024-08-03.txt  -2.391
 Query processed in 78 seconds.
--- a/tests/results_gpt4omini_t0.1.txt
+++ b/tests/results_gpt4omini_t0.1.txt
@ -0,0 +1,65 @@
 run_query.sh
 Passages that quote Louis Menand.
 Query: --query Passages that quote Louis Menand.
 Vector: 20, BM25: 20, overlap: 7, merged: 33, re-ranked to: 15
 Response:
 The excerpts provided contain several passages that quote or reference Louis Menand, particularly focusing on his ideas about 
 hypostasis and the human relationship with concepts such as God and societal structures. Here are the relevant passages:
 1. **File: ./data/2025-11-04.txt**
   - This excerpt includes the quote: "We created God, and then pretended that God created us. We hypostatized our own concept and 
 turned it into something 'out there' whose commandments (which we made up) we struggle to understand and obey. We are supplicants 
 to our own fiction." It discusses the parallels between this idea and other societal constructs like democracy and politics.
 2. **File: ./data/2025-02-14.txt**
   - The passage reiterates Menand's quote about hypostasis, emphasizing how humans have taken abstract concepts and given them a 
 separate existence, which allows for worship and obedience. It reflects on the implications of this process on our understanding 
 of reality.
 3. **File: ./data/2022-08-14.txt**
   - This excerpt discusses Menand's critique of how people perceive forces as external and inevitable, which are actually human 
 creations. It includes the same quote about God and hypostasis, highlighting the reification of human concepts.
 4. **File: ./data/2025-07-27.txt**
   - Here, Menand's quote is referenced in the context of discussing narratives and organizational structures, emphasizing the 
 fictional nature of societal constructs.
 5. **File: ./data/2025-02-05.txt**
   - This excerpt paraphrases Menand's idea that we are "supplicants to our own fictions," linking it to the broader theme of 
 storytelling in human culture.
 6. **File: ./data/2024-09-04.txt**
   - Although not a direct quote, this excerpt mentions Menand's work in relation to AI and copyright, indicating his relevance in 
 contemporary discussions.
 In summary, the passages collectively illustrate Menand's exploration of how humans create and interact with abstract concepts, 
 particularly the notion of God, and how these constructs influence our understanding of reality and societal structures.
 **Files that contributed to the answer:**
 - **2025-11-04.txt**: Discusses the quote in relation to societal constructs.
 - **2025-02-14.txt**: Reiterates the quote and its implications on reality.
 - **2022-08-14.txt**: Analyzes Menand's critique of perception and includes the quote.
 - **2025-07-27.txt**: References the quote in the context of narratives and organization.
 - **2025-02-05.txt**: Paraphrases Menand's idea about fictions in human culture.
 - **2024-09-04.txt**: Mentions Menand's relevance in discussions about AI.
 Source documents:
 2025-11-04.txt  ./data/2025-11-04.txt  3.303
 2025-02-14.txt  ./data/2025-02-14.txt  3.144
 2022-08-14.txt  ./data/2022-08-14.txt  2.183
 2021-07-04.txt  ./data/2021-07-04.txt  1.595
 2025-02-14.txt  ./data/2025-02-14.txt  1.504
 2025-07-27.txt  ./data/2025-07-27.txt  0.171
 2024-07-26.txt  ./data/2024-07-26.txt  -0.969
 2024-09-06.txt  ./data/2024-09-06.txt  -1.661
 2024-04-08.txt  ./data/2024-04-08.txt  -2.017
 2025-02-05.txt  ./data/2025-02-05.txt  -2.027
 2022-04-17.txt  ./data/2022-04-17.txt  -2.054
 2023-07-01.txt  ./data/2023-07-01.txt  -2.284
 2023-01-06.txt  ./data/2023-01-06.txt  -2.315
 2024-09-04.txt  ./data/2024-09-04.txt  -2.379
 2024-08-03.txt  ./data/2024-08-03.txt  -2.391
 Query processed in 44 seconds.
--- a/tests/results_gpt4omini_t0.3.txt
+++ b/tests/results_gpt4omini_t0.3.txt
@ -0,0 +1,63 @@
 Query: --query Passages that quote Louis Menand.
 Vector: 20, BM25: 20, overlap: 7, merged: 33, re-ranked to: 15
 Response:
 The excerpts contain several passages that quote or reference Louis Menand, particularly focusing on his ideas about hypostasis 
 and the human relationship with concepts such as God and societal structures. Here are the relevant passages:
 1. **File: ./data/2025-11-04.txt**
   - This excerpt includes the quote: "We created God, and then pretended that God created us. We hypostatized our own concept and 
 turned it into something 'out there' whose commandments (which we made up) we struggle to understand and obey. We are supplicants 
 to our own fiction." It discusses the parallels between this idea and other societal constructs like democracy and politics.
 2. **File: ./data/2025-02-14.txt**
   - The passage reiterates Menand's quote about hypostasis and emphasizes how humans have created a sense of distance from their 
 own creations, allowing for worship or rebellion against them. It reflects on the implications of this process for understanding 
 reality and human experience.
 3. **File: ./data/2022-08-14.txt**
   - This excerpt discusses Menand's "Free World," specifically pages 335-339, where he critiques how people perceive external 
 forces as inevitable, which are actually human creations. It includes the same quote about God and hypostasis, highlighting the 
 idea that these constructs are reified as "the way things are."
 4. **File: ./data/2025-07-27.txt**
   - This excerpt references Menand's quote in the context of discussing anti-capitalist themes and the narratives that organize 
 society, emphasizing the fictional nature of these constructs.
 5. **File: ./data/2025-02-05.txt**
   - Here, Menand's notion that we are "supplicants to our own fictions" is paraphrased, relating it to the broader role of 
 storytelling in human culture and how we create and live by the narratives we construct.
 6. **File: ./data/2024-09-04.txt**
   - This excerpt mentions Menand in the context of discussing AI and copyright, though it does not quote him directly. It 
 highlights his relevance in contemporary discussions about technology.
 ### Summary of Contributions:
 - **2025-11-04.txt**: Introduces Menand's quote and connects it to societal constructs.
 - **2025-02-14.txt**: Explores the implications of Menand's ideas on reality and human experience.
 - **2022-08-14.txt**: Discusses Menand's critique of how humans perceive societal constructs as external forces.
 - **2025-07-27.txt**: Relates Menand's quote to anti-capitalist themes and societal narratives.
 - **2025-02-05.txt**: Paraphrases Menand's ideas in the context of storytelling in culture.
 - **2024-09-04.txt**: Mentions Menand in discussions about AI, highlighting his ongoing relevance.
 Overall, these excerpts collectively illustrate Menand's critical perspective on how humans create and relate to their own 
 constructs, emphasizing the fictional nature of many societal narratives.
 Source documents:
 2025-11-04.txt  ./data/2025-11-04.txt  3.303
 2025-02-14.txt  ./data/2025-02-14.txt  3.144
 2022-08-14.txt  ./data/2022-08-14.txt  2.183
 2021-07-04.txt  ./data/2021-07-04.txt  1.595
 2025-02-14.txt  ./data/2025-02-14.txt  1.504
 2025-07-27.txt  ./data/2025-07-27.txt  0.171
 2024-07-26.txt  ./data/2024-07-26.txt  -0.969
 2024-09-06.txt  ./data/2024-09-06.txt  -1.661
 2024-04-08.txt  ./data/2024-04-08.txt  -2.017
 2025-02-05.txt  ./data/2025-02-05.txt  -2.027
 2022-04-17.txt  ./data/2022-04-17.txt  -2.054
 2023-07-01.txt  ./data/2023-07-01.txt  -2.284
 2023-01-06.txt  ./data/2023-01-06.txt  -2.315
 2024-09-04.txt  ./data/2024-09-04.txt  -2.379
 2024-08-03.txt  ./data/2024-08-03.txt  -2.391
 Query processed in 45 seconds.
--- a/vs_metrics.ipynb
+++ b/vs_metrics.ipynb
		`@ -0,0 +1 @@`
							`This directory contains collections of interesting output from the nd_ssearch query engine.`