Sync RAG and semantic-search updates from che-computing
- 03-rag, 04-semantic-search: env-var-before-imports fix in build/query scripts - 03-rag: new libraries section, fetch_arxiv.py, exercises for larger corpus and finding current SOTA models, formal references (Lewis, Booth) - 04-semantic-search: libraries pointer back to Part III, larger corpus subsection, model-update exercise, formal references - 06-neural-networks: add Nielsen reference (recommended by student) - README: vocab.md link, agentic systems in description, Ollama prereq for 02-05 - New: vocab.md (glossary organized by section) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
parent
b37661e983
commit
59e5f86884
9 changed files with 359 additions and 17 deletions
|
|
@ -100,9 +100,46 @@ Save this as `cache_model.py` and run it:
|
|||
```bash
|
||||
python cache_model.py
|
||||
```
|
||||
(This is also saved in the Github.) Each script that uses the model will set environmental variables to prevent checking for updates. You can manually update either by running `cache_model.py` or editing the scripts themselves.
|
||||
|
||||
## 2. The libraries we use
|
||||
|
||||
A RAG system is built from three independent layers, each handled by a different library:
|
||||
|
||||
| Layer | Library | What it does |
|
||||
|-------|---------|--------------|
|
||||
| **Orchestration** | [LlamaIndex](https://docs.llamaindex.ai/) | Glues the pieces together: chunking, indexing, retrieval, prompt assembly, response synthesis |
|
||||
| **Embeddings** | [Hugging Face](https://huggingface.co/) (via `sentence-transformers`) | Provides the model that converts text into vectors |
|
||||
| **Generation** | [Ollama](https://ollama.com/) | Runs the LLM that produces the final answer |
|
||||
|
||||
LlamaIndex used to be a single package; since version 0.10 it has been split into a small `llama-index-core` plus dedicated integration packages. That is why our `pip install` line includes several `llama-index-*` packages -- one for each external thing we plug in (Ollama for the LLM, Hugging Face for embeddings, file readers for local documents). If you find older tutorials online that import from `llama_index` (no `.core`), they predate the split and will not work.
|
||||
|
||||
The two key patterns to recognize in `build.py` and `query.py`:
|
||||
|
||||
**1. Global `Settings`.** Instead of passing the LLM and embedding model into every call, LlamaIndex uses a global `Settings` object:
|
||||
|
||||
```python
|
||||
Settings.llm = Ollama(model="command-r7b", request_timeout=360.0)
|
||||
Settings.embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-large-en-v1.5")
|
||||
```
|
||||
|
||||
After these two lines, every component (index, query engine, retriever) automatically uses the configured models. This replaced the older `ServiceContext` pattern, which has been removed.
|
||||
|
||||
**2. Environment variables before imports.** At the top of each script:
|
||||
|
||||
```python
|
||||
import os
|
||||
os.environ["TOKENIZERS_PARALLELISM"] = "false"
|
||||
os.environ["SENTENCE_TRANSFORMERS_HOME"] = "./models"
|
||||
os.environ["HF_HUB_OFFLINE"] = "1"
|
||||
```
|
||||
|
||||
These must come *before* `from llama_index...` imports, because the Hugging Face libraries read the environment at import time. `HF_HUB_OFFLINE=1` tells the libraries not to check the Hub for updates on every run -- without it, you will see a "sending unauthenticated requests to the HF Hub" warning and the script may slow down or stall on a poor connection. `SENTENCE_TRANSFORMERS_HOME` controls where embedding models are cached.
|
||||
|
||||
If you ever swap to a different embedding model and need a fresh download, temporarily remove `HF_HUB_OFFLINE` for one run or use a standalone script like `cache_model.py`.
|
||||
|
||||
|
||||
## 2. The documents
|
||||
## 3. The documents
|
||||
|
||||
The `data/` directory contains 10 emails from the University of Delaware president's office, spanning 2012–2025 (the same set from Part II). Each is a plain text file with a subject line, date, and body text.
|
||||
|
||||
|
|
@ -124,7 +161,7 @@ python clean_eml.py
|
|||
This extracts the subject, date, and body from each email and writes a dated `.txt` file to `./data`.
|
||||
|
||||
|
||||
## 3. Building the vector store
|
||||
## 4. Building the vector store
|
||||
|
||||
The script `build.py` does the heavy lifting:
|
||||
|
||||
|
|
@ -154,7 +191,7 @@ We can't embed an entire document as a single vector — it would lose too much
|
|||
> **Exercise 1:** Look at `build.py`. What would happen if you made the chunks much smaller (e.g., 100 tokens)? Much larger (e.g., 2000 tokens)? Think about the tradeoff between precision and context.
|
||||
|
||||
|
||||
## 4. Querying the vector store
|
||||
## 5. Querying the vector store
|
||||
|
||||
The script `query.py` loads the stored index, takes your question, and returns a response grounded in the documents:
|
||||
|
||||
|
|
@ -207,7 +244,7 @@ Notice the **similarity scores** — these are cosine similarities between the q
|
|||
> **Exercise 2:** Run the same query twice. Do you get exactly the same output? Why or why not?
|
||||
|
||||
|
||||
## 5. Understanding the pieces
|
||||
## 6. Understanding the pieces
|
||||
|
||||
### The embedding model
|
||||
|
||||
|
|
@ -236,7 +273,7 @@ Our custom prompt in `query.py` is more detailed — it asks for structured outp
|
|||
> **Exercise 3:** Modify the prompt in `query.py`. For example, ask the model to respond in the style of a news reporter, or to focus only on dates and events. How does the output change?
|
||||
|
||||
|
||||
## 6. Exercises
|
||||
## 7. Exercises
|
||||
|
||||
> **Exercise 4:** Try different embedding models. Replace `BAAI/bge-large-en-v1.5` with `sentence-transformers/all-mpnet-base-v2` in both `build.py` and `query.py`. Rebuild the vector store and compare the results.
|
||||
|
||||
|
|
@ -246,6 +283,30 @@ Our custom prompt in `query.py` is more detailed — it asks for structured outp
|
|||
|
||||
> **Exercise 7:** Bring your own documents. Find a collection of text files — research paper abstracts, class notes, or a downloaded text from Project Gutenberg — and build a RAG system over them. What questions can you answer that a plain LLM cannot?
|
||||
|
||||
> **Exercise 8 (optional, sets up Part IV):** Build a larger corpus. Ten emails is small enough that retrieval is barely selective — the system returns most of the corpus on every query. The script `fetch_arxiv.py` pulls 100 recent abstracts from a chosen arXiv category and writes one text file per abstract:
|
||||
>
|
||||
> ```bash
|
||||
> python fetch_arxiv.py --category cs.LG --max 100 --output data_arxiv
|
||||
> ```
|
||||
>
|
||||
> Try other categories: `physics.chem-ph` (chemical physics), `cond-mat.soft` (soft matter), `cs.AI` (artificial intelligence), `cs.CL` (computational linguistics), `physics.flu-dyn` (fluid dynamics). Then update `build.py` to point at your new directory (or symlink it as `./data`), rebuild the vector store, and query it. With a 100-document corpus, retrieval becomes meaningfully selective and the choice of embedding model matters more.
|
||||
>
|
||||
> Other corpora to consider:
|
||||
> - **CCPS process safety case studies** — https://www.aiche.org/ccps/resources (some are openly available as text or PDF)
|
||||
> - **US Chemical Safety Board incident reports** — https://www.csb.gov/investigations/
|
||||
> - **NIST chemistry data sheets** — https://webbook.nist.gov/
|
||||
> - **AIChE journal abstracts** — many publishers expose abstracts via their APIs
|
||||
>
|
||||
> If your sources are PDFs, install `llama-index-readers-file` (already in `requirements.txt`) and use `SimpleDirectoryReader` — it picks up `.pdf` automatically.
|
||||
|
||||
> **Exercise 9 (optional):** The embedding model `BAAI/bge-large-en-v1.5` and the LLM `command-r7b` were both released in 2024. By the time you read this, newer and likely better models exist. Find a current state-of-the-art:
|
||||
>
|
||||
> - Browse the [MTEB leaderboard](https://huggingface.co/spaces/mteb/leaderboard) for current top embedding models
|
||||
> - Browse [Ollama's model library](https://ollama.com/library) sorted by recent or popular for current LLMs
|
||||
> - Replace one model at a time in `build.py` and `query.py`, rebuild if the embedding model changes, and compare retrieval quality
|
||||
>
|
||||
> Document the model versions and dates in your machine log. Models that "feel old" are part of the engineering reality of working with this stack — what was best last year may not be best today.
|
||||
|
||||
|
||||
## Additional resources and references
|
||||
|
||||
|
|
@ -270,5 +331,6 @@ Other embedding model mentioned: `sentence-transformers/all-mpnet-base-v2`
|
|||
|
||||
### Further reading
|
||||
|
||||
- NIST IR 8579, [*Developing the NCCoE Chatbot: Technical and Security Learnings from the Initial Implementation*](https://csrc.nist.gov/pubs/ir/8579/ipd) ([PDF](https://nvlpubs.nist.gov/nistpubs/ir/2025/NIST.IR.8579.ipd.pdf)) — practical guidance on building a RAG-based chatbot, including architecture and security considerations
|
||||
- Open WebUI (https://openwebui.com) — a turnkey local RAG interface if you want a GUI
|
||||
- Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. 2020. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. In *Advances in Neural Information Processing Systems*, 2020. Curran Associates, Inc., 9459–9474. https://proceedings.neurips.cc/paper_files/paper/2020/hash/6b493230205f780e1bc26945df7481e5-Abstract.html — the foundational RAG paper that introduced the retrieve-and-generate framework we use here.
|
||||
- Harold Booth. 2025. *Development and Implementation of the NCCoE Chatbot: A Comprehensive Report*. National Institute of Standards and Technology, Gaithersburg, MD. https://doi.org/10.6028/NIST.IR.8579.ipd — practical guidance on building a RAG-based chatbot, including architecture and security considerations.
|
||||
- Open WebUI (https://openwebui.com) — a turnkey local RAG interface if you want a GUI.
|
||||
|
|
|
|||
|
|
@ -6,6 +6,14 @@
|
|||
# August 2025
|
||||
# E. M. Furst
|
||||
|
||||
# Environment vars must be set before importing huggingface/transformers
|
||||
# libraries, because huggingface_hub.constants evaluates HF_HUB_OFFLINE
|
||||
# at import time.
|
||||
import os
|
||||
os.environ["TOKENIZERS_PARALLELISM"] = "false"
|
||||
os.environ["SENTENCE_TRANSFORMERS_HOME"] = "./models"
|
||||
os.environ["HF_HUB_OFFLINE"] = "1"
|
||||
|
||||
from llama_index.core import (
|
||||
SimpleDirectoryReader,
|
||||
VectorStoreIndex,
|
||||
|
|
|
|||
112
03-rag/fetch_arxiv.py
Normal file
112
03-rag/fetch_arxiv.py
Normal file
|
|
@ -0,0 +1,112 @@
|
|||
# fetch_arxiv.py
|
||||
#
|
||||
# Fetch arXiv abstracts and write each as a separate text file in ./data.
|
||||
# This builds a larger, more interesting corpus for RAG experiments.
|
||||
#
|
||||
# Default: 100 most recent abstracts in cs.LG (machine learning).
|
||||
# Try other categories: physics.chem-ph, cond-mat.soft, cs.AI, cs.CL,
|
||||
# physics.flu-dyn, etc.
|
||||
#
|
||||
# Usage:
|
||||
# python fetch_arxiv.py # default: cs.LG, 100 papers
|
||||
# python fetch_arxiv.py --category cs.AI
|
||||
# python fetch_arxiv.py --category cs.LG --max 200 --output data_arxiv
|
||||
#
|
||||
# CHEG 667-013
|
||||
|
||||
import argparse
|
||||
import os
|
||||
import re
|
||||
import time
|
||||
import urllib.parse
|
||||
import urllib.request
|
||||
import xml.etree.ElementTree as ET
|
||||
|
||||
|
||||
def fetch_abstracts(category, max_results, batch_size=50):
|
||||
"""Fetch arXiv abstracts in batches via the API."""
|
||||
base_url = "https://export.arxiv.org/api/query"
|
||||
ns = {"atom": "http://www.w3.org/2005/Atom"}
|
||||
entries = []
|
||||
|
||||
for start in range(0, max_results, batch_size):
|
||||
n = min(batch_size, max_results - start)
|
||||
params = {
|
||||
"search_query": f"cat:{category}",
|
||||
"sortBy": "submittedDate",
|
||||
"sortOrder": "descending",
|
||||
"start": start,
|
||||
"max_results": n,
|
||||
}
|
||||
url = f"{base_url}?{urllib.parse.urlencode(params)}"
|
||||
print(f" fetching {start+1}-{start+n}...")
|
||||
with urllib.request.urlopen(url) as resp:
|
||||
data = resp.read()
|
||||
root = ET.fromstring(data)
|
||||
batch = root.findall("atom:entry", ns)
|
||||
if not batch:
|
||||
break
|
||||
entries.extend(batch)
|
||||
time.sleep(3) # arXiv asks for 3-second delay between requests
|
||||
return entries
|
||||
|
||||
|
||||
def safe_filename(s, max_len=80):
|
||||
"""Convert a title to a filesystem-safe filename."""
|
||||
s = re.sub(r"\s+", "_", s.strip())
|
||||
s = re.sub(r"[^A-Za-z0-9._-]", "", s)
|
||||
return s[:max_len]
|
||||
|
||||
|
||||
def write_abstract(entry, outdir):
|
||||
"""Extract title, authors, date, abstract; write to a text file."""
|
||||
ns = {"atom": "http://www.w3.org/2005/Atom"}
|
||||
title = entry.find("atom:title", ns).text.strip()
|
||||
summary = entry.find("atom:summary", ns).text.strip()
|
||||
published = entry.find("atom:published", ns).text.strip()[:10]
|
||||
authors = [
|
||||
a.find("atom:name", ns).text
|
||||
for a in entry.findall("atom:author", ns)
|
||||
]
|
||||
arxiv_id = entry.find("atom:id", ns).text.strip().split("/")[-1]
|
||||
|
||||
fname = f"{published}_{safe_filename(title)}.txt"
|
||||
path = os.path.join(outdir, fname)
|
||||
body = (
|
||||
f"Title: {title}\n"
|
||||
f"Authors: {', '.join(authors)}\n"
|
||||
f"Date: {published}\n"
|
||||
f"arXiv: {arxiv_id}\n"
|
||||
f"\n"
|
||||
f"{summary}\n"
|
||||
)
|
||||
with open(path, "w") as f:
|
||||
f.write(body)
|
||||
return fname
|
||||
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser(description=__doc__)
|
||||
parser.add_argument("--category", default="cs.LG",
|
||||
help="arXiv category (default: cs.LG)")
|
||||
parser.add_argument("--max", type=int, default=100,
|
||||
help="number of abstracts to fetch (default: 100)")
|
||||
parser.add_argument("--output", default="data",
|
||||
help="output directory (default: data)")
|
||||
args = parser.parse_args()
|
||||
|
||||
os.makedirs(args.output, exist_ok=True)
|
||||
print(f"Fetching {args.max} abstracts from arXiv:{args.category} -> {args.output}/")
|
||||
entries = fetch_abstracts(args.category, args.max)
|
||||
print(f"Got {len(entries)} entries. Writing to {args.output}/...")
|
||||
for e in entries:
|
||||
try:
|
||||
fname = write_abstract(e, args.output)
|
||||
except Exception as exc:
|
||||
print(f" skipped one: {exc}")
|
||||
continue
|
||||
print(f"Done. {len(os.listdir(args.output))} files in {args.output}/")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
|
|
@ -5,6 +5,14 @@
|
|||
# August 2025
|
||||
# E. M. Furst
|
||||
|
||||
# Environment vars must be set before importing huggingface/transformers
|
||||
# libraries, because huggingface_hub.constants evaluates HF_HUB_OFFLINE
|
||||
# at import time.
|
||||
import os
|
||||
os.environ["TOKENIZERS_PARALLELISM"] = "false"
|
||||
os.environ["SENTENCE_TRANSFORMERS_HOME"] = "./models"
|
||||
os.environ["HF_HUB_OFFLINE"] = "1"
|
||||
|
||||
from llama_index.core import (
|
||||
load_index_from_storage,
|
||||
StorageContext,
|
||||
|
|
@ -13,12 +21,7 @@ from llama_index.core import (
|
|||
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
|
||||
from llama_index.llms.ollama import Ollama
|
||||
from llama_index.core.prompts import PromptTemplate
|
||||
import os, time
|
||||
|
||||
#
|
||||
# Globals
|
||||
#
|
||||
os.environ["TOKENIZERS_PARALLELISM"] = "false"
|
||||
import time
|
||||
|
||||
# Embedding model used in vector store (this should match the one in build.py)
|
||||
embed_model = HuggingFaceEmbedding(cache_folder="./models",
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue