Sync RAG and semantic-search updates from che-computing

- 03-rag, 04-semantic-search: env-var-before-imports fix in build/query scripts
- 03-rag: new libraries section, fetch_arxiv.py, exercises for larger corpus
  and finding current SOTA models, formal references (Lewis, Booth)
- 04-semantic-search: libraries pointer back to Part III, larger corpus
  subsection, model-update exercise, formal references
- 06-neural-networks: add Nielsen reference (recommended by student)
- README: vocab.md link, agentic systems in description, Ollama prereq for 02-05
- New: vocab.md (glossary organized by section)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
Eric Furst 2026-04-28 12:05:08 -04:00
commit 59e5f86884
9 changed files with 359 additions and 17 deletions

View file

@ -103,9 +103,45 @@ Make sure `ollama` is running and `command-r7b` is available:
ollama pull command-r7b
```
### Libraries and environment variables
This section uses the same three-layer architecture introduced in [Part III](../03-rag/README.md#2-the-libraries-we-use), LlamaIndex for orchestration, Hugging Face for the embedding and cross-encoder models, and Ollama for response generation, plus one new piece: `llama-index-retrievers-bm25` for keyword-based retrieval. BM25 is a classical, non-neural algorithm that complements the neural embedding model.
The same `Settings`-based configuration applies, and the same environment-variable pattern is used at the top of every script:
```python
import os
os.environ["TOKENIZERS_PARALLELISM"] = "false"
os.environ["SENTENCE_TRANSFORMERS_HOME"] = "./models"
os.environ["HF_HUB_OFFLINE"] = "1"
```
These come *before* any LlamaIndex or Hugging Face imports because the libraries read the environment at import time. `HF_HUB_OFFLINE=1` prevents the "sending unauthenticated requests to the HF Hub" warning and makes runs deterministic. Remove it temporarily if you want to download a fresh model. See [Part III, section 2](../03-rag/README.md#2-the-libraries-we-use) for the full explanation.
## 3. Building the vector store
### A larger corpus
The hybrid retrieval and re-ranking pipeline only earns its keep on a corpus large enough that retrieval is genuinely selective. The 10 emails from Part III are too few -- vector and BM25 will return almost the same chunks every time. For this section, we recommend a 100-abstract arXiv corpus.
If you did Exercise 8 in Part III, you already have one. Otherwise, run:
```bash
python ../03-rag/fetch_arxiv.py --category cs.LG --max 100 --output data
```
This populates `./data` with 100 recent papers from `cs.LG` (machine learning). Other relevant categories:
- `physics.chem-ph` -- chemical physics
- `cond-mat.soft` -- soft matter
- `physics.flu-dyn` -- fluid dynamics
- `cs.AI` -- artificial intelligence
You can also drop in your own collection: NIST data sheets, CCPS process safety case studies (https://www.aiche.org/ccps/resources), US Chemical Safety Board incident reports (https://www.csb.gov/investigations/), or any other text-format documents. PDFs work too -- `SimpleDirectoryReader` reads them automatically when `llama-index-readers-file` is installed.
### Building the index
The `build_store.py` script works like the one in Part III, with a few differences:
- **Smaller chunks**: 256 tokens (vs. 500 in Part III) with 25 tokens of overlap
@ -249,6 +285,8 @@ This is a fallback when you know exactly what you're looking for and don't need
> **Exercise 8:** Build this system over your own document collection — class notes, research papers, or a downloaded text corpus. Which retrieval mode works best for your documents?
> **Exercise 9:** The cross-encoder we use (`cross-encoder/ms-marco-MiniLM-L-12-v2`) and the embedding model (`BAAI/bge-large-en-v1.5`) date from 2022-2024. Newer models are likely available. Browse the [MTEB leaderboard](https://huggingface.co/spaces/mteb/leaderboard) for current top embedding models and re-rankers. Swap one in (you will need a full rebuild if you change the embedding model). Does retrieval quality improve? At what cost in model size and speed? This is a recurring task in production systems — models age, and the right answer in 2024 is not the right answer today.
## Additional resources and references
@ -272,5 +310,6 @@ This is a fallback when you know exactly what you're looking for and don't need
### Further reading
- Robertson & Zaragoza, *The Probabilistic Relevance Framework: BM25 and Beyond* (2009) — the theory behind BM25
- Nogueira & Cho, *Passage Re-ranking with BERT* (2019) — cross-encoder re-ranking applied to information retrieval
- Stephen Robertson and Hugo Zaragoza. 2009. *The Probabilistic Relevance Framework: BM25 and Beyond*. Foundations and Trends in Information Retrieval 3, 4 (April 2009), 333-389. https://doi.org/10.1561/1500000019 — the theoretical basis for BM25.
- Rodrigo Nogueira and Kyunghyun Cho. 2019. Passage Re-ranking with BERT. arXiv:1901.04085. https://arxiv.org/abs/1901.04085 — cross-encoder re-ranking applied to information retrieval; the approach we use in `query_hybrid.py`.
- Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In *Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing*, 3982-3992. https://arxiv.org/abs/1908.10084 — the foundational paper for `sentence-transformers`, the library behind both our embedding and cross-encoder models.