Sync RAG and semantic-search updates from che-computing

- 03-rag, 04-semantic-search: env-var-before-imports fix in build/query scripts - 03-rag: new libraries section, fetch_arxiv.py, exercises for larger corpus and finding current SOTA models, formal references (Lewis, Booth) - 04-semantic-search: libraries pointer back to Part III, larger corpus subsection, model-update exercise, formal references - 06-neural-networks: add Nielsen reference (recommended by student) - README: vocab.md link, agentic systems in description, Ollama prereq for 02-05 - New: vocab.md (glossary organized by section) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-28 12:05:08 -04:00 · 2026-04-28 12:05:08 -04:00 · 59e5f86884
commit 59e5f86884
parent b37661e983
9 changed files with 359 additions and 17 deletions
--- a/04-semantic-search/README.md
+++ b/04-semantic-search/README.md
@ -103,9 +103,45 @@ Make sure `ollama` is running and `command-r7b` is available:
 ollama pull command-r7b
 ```

+### Libraries and environment variables
+
+This section uses the same three-layer architecture introduced in [Part III](../03-rag/README.md#2-the-libraries-we-use), LlamaIndex for orchestration, Hugging Face for the embedding and cross-encoder models, and Ollama for response generation, plus one new piece: `llama-index-retrievers-bm25` for keyword-based retrieval. BM25 is a classical, non-neural algorithm that complements the neural embedding model.
+
+The same `Settings`-based configuration applies, and the same environment-variable pattern is used at the top of every script:
+
+```python
+import os
+os.environ["TOKENIZERS_PARALLELISM"] = "false"
+os.environ["SENTENCE_TRANSFORMERS_HOME"] = "./models"
+os.environ["HF_HUB_OFFLINE"] = "1"
+```
+
+These come *before* any LlamaIndex or Hugging Face imports because the libraries read the environment at import time. `HF_HUB_OFFLINE=1` prevents the "sending unauthenticated requests to the HF Hub" warning and makes runs deterministic. Remove it temporarily if you want to download a fresh model. See [Part III, section 2](../03-rag/README.md#2-the-libraries-we-use) for the full explanation.
+

 ## 3. Building the vector store

+### A larger corpus
+
+The hybrid retrieval and re-ranking pipeline only earns its keep on a corpus large enough that retrieval is genuinely selective. The 10 emails from Part III are too few -- vector and BM25 will return almost the same chunks every time. For this section, we recommend a 100-abstract arXiv corpus.
+
+If you did Exercise 8 in Part III, you already have one. Otherwise, run:
+
+```bash
+python ../03-rag/fetch_arxiv.py --category cs.LG --max 100 --output data
+```
+
+This populates `./data` with 100 recent papers from `cs.LG` (machine learning). Other relevant categories:
+
+- `physics.chem-ph` -- chemical physics
+- `cond-mat.soft` -- soft matter
+- `physics.flu-dyn` -- fluid dynamics
+- `cs.AI` -- artificial intelligence
+
+You can also drop in your own collection: NIST data sheets, CCPS process safety case studies (https://www.aiche.org/ccps/resources), US Chemical Safety Board incident reports (https://www.csb.gov/investigations/), or any other text-format documents. PDFs work too -- `SimpleDirectoryReader` reads them automatically when `llama-index-readers-file` is installed.
+
+### Building the index
+
 The `build_store.py` script works like the one in Part III, with a few differences:

 - **Smaller chunks**: 256 tokens (vs. 500 in Part III) with 25 tokens of overlap
@ -249,6 +285,8 @@ This is a fallback when you know exactly what you're looking for and don't need

 > **Exercise 8:** Build this system over your own document collection — class notes, research papers, or a downloaded text corpus. Which retrieval mode works best for your documents?

+> **Exercise 9:** The cross-encoder we use (`cross-encoder/ms-marco-MiniLM-L-12-v2`) and the embedding model (`BAAI/bge-large-en-v1.5`) date from 2022-2024. Newer models are likely available. Browse the [MTEB leaderboard](https://huggingface.co/spaces/mteb/leaderboard) for current top embedding models and re-rankers. Swap one in (you will need a full rebuild if you change the embedding model). Does retrieval quality improve? At what cost in model size and speed? This is a recurring task in production systems — models age, and the right answer in 2024 is not the right answer today.
+

 ## Additional resources and references

@ -272,5 +310,6 @@ This is a fallback when you know exactly what you're looking for and don't need

 ### Further reading

- Robertson & Zaragoza, *The Probabilistic Relevance Framework: BM25 and Beyond* (2009) — the theory behind BM25
- Nogueira & Cho, *Passage Re-ranking with BERT* (2019) — cross-encoder re-ranking applied to information retrieval
+- Stephen Robertson and Hugo Zaragoza. 2009. *The Probabilistic Relevance Framework: BM25 and Beyond*. Foundations and Trends in Information Retrieval 3, 4 (April 2009), 333-389. https://doi.org/10.1561/1500000019 — the theoretical basis for BM25.
+- Rodrigo Nogueira and Kyunghyun Cho. 2019. Passage Re-ranking with BERT. arXiv:1901.04085. https://arxiv.org/abs/1901.04085 — cross-encoder re-ranking applied to information retrieval; the approach we use in `query_hybrid.py`.
+- Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In *Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing*, 3982-3992. https://arxiv.org/abs/1908.10084 — the foundational paper for `sentence-transformers`, the library behind both our embedding and cross-encoder models.