Sync RAG and semantic-search updates from che-computing
- 03-rag, 04-semantic-search: env-var-before-imports fix in build/query scripts - 03-rag: new libraries section, fetch_arxiv.py, exercises for larger corpus and finding current SOTA models, formal references (Lewis, Booth) - 04-semantic-search: libraries pointer back to Part III, larger corpus subsection, model-update exercise, formal references - 06-neural-networks: add Nielsen reference (recommended by student) - README: vocab.md link, agentic systems in description, Ollama prereq for 02-05 - New: vocab.md (glossary organized by section) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
parent
b37661e983
commit
59e5f86884
9 changed files with 359 additions and 17 deletions
|
|
@ -103,9 +103,45 @@ Make sure `ollama` is running and `command-r7b` is available:
|
|||
ollama pull command-r7b
|
||||
```
|
||||
|
||||
### Libraries and environment variables
|
||||
|
||||
This section uses the same three-layer architecture introduced in [Part III](../03-rag/README.md#2-the-libraries-we-use), LlamaIndex for orchestration, Hugging Face for the embedding and cross-encoder models, and Ollama for response generation, plus one new piece: `llama-index-retrievers-bm25` for keyword-based retrieval. BM25 is a classical, non-neural algorithm that complements the neural embedding model.
|
||||
|
||||
The same `Settings`-based configuration applies, and the same environment-variable pattern is used at the top of every script:
|
||||
|
||||
```python
|
||||
import os
|
||||
os.environ["TOKENIZERS_PARALLELISM"] = "false"
|
||||
os.environ["SENTENCE_TRANSFORMERS_HOME"] = "./models"
|
||||
os.environ["HF_HUB_OFFLINE"] = "1"
|
||||
```
|
||||
|
||||
These come *before* any LlamaIndex or Hugging Face imports because the libraries read the environment at import time. `HF_HUB_OFFLINE=1` prevents the "sending unauthenticated requests to the HF Hub" warning and makes runs deterministic. Remove it temporarily if you want to download a fresh model. See [Part III, section 2](../03-rag/README.md#2-the-libraries-we-use) for the full explanation.
|
||||
|
||||
|
||||
## 3. Building the vector store
|
||||
|
||||
### A larger corpus
|
||||
|
||||
The hybrid retrieval and re-ranking pipeline only earns its keep on a corpus large enough that retrieval is genuinely selective. The 10 emails from Part III are too few -- vector and BM25 will return almost the same chunks every time. For this section, we recommend a 100-abstract arXiv corpus.
|
||||
|
||||
If you did Exercise 8 in Part III, you already have one. Otherwise, run:
|
||||
|
||||
```bash
|
||||
python ../03-rag/fetch_arxiv.py --category cs.LG --max 100 --output data
|
||||
```
|
||||
|
||||
This populates `./data` with 100 recent papers from `cs.LG` (machine learning). Other relevant categories:
|
||||
|
||||
- `physics.chem-ph` -- chemical physics
|
||||
- `cond-mat.soft` -- soft matter
|
||||
- `physics.flu-dyn` -- fluid dynamics
|
||||
- `cs.AI` -- artificial intelligence
|
||||
|
||||
You can also drop in your own collection: NIST data sheets, CCPS process safety case studies (https://www.aiche.org/ccps/resources), US Chemical Safety Board incident reports (https://www.csb.gov/investigations/), or any other text-format documents. PDFs work too -- `SimpleDirectoryReader` reads them automatically when `llama-index-readers-file` is installed.
|
||||
|
||||
### Building the index
|
||||
|
||||
The `build_store.py` script works like the one in Part III, with a few differences:
|
||||
|
||||
- **Smaller chunks**: 256 tokens (vs. 500 in Part III) with 25 tokens of overlap
|
||||
|
|
@ -249,6 +285,8 @@ This is a fallback when you know exactly what you're looking for and don't need
|
|||
|
||||
> **Exercise 8:** Build this system over your own document collection — class notes, research papers, or a downloaded text corpus. Which retrieval mode works best for your documents?
|
||||
|
||||
> **Exercise 9:** The cross-encoder we use (`cross-encoder/ms-marco-MiniLM-L-12-v2`) and the embedding model (`BAAI/bge-large-en-v1.5`) date from 2022-2024. Newer models are likely available. Browse the [MTEB leaderboard](https://huggingface.co/spaces/mteb/leaderboard) for current top embedding models and re-rankers. Swap one in (you will need a full rebuild if you change the embedding model). Does retrieval quality improve? At what cost in model size and speed? This is a recurring task in production systems — models age, and the right answer in 2024 is not the right answer today.
|
||||
|
||||
|
||||
## Additional resources and references
|
||||
|
||||
|
|
@ -272,5 +310,6 @@ This is a fallback when you know exactly what you're looking for and don't need
|
|||
|
||||
### Further reading
|
||||
|
||||
- Robertson & Zaragoza, *The Probabilistic Relevance Framework: BM25 and Beyond* (2009) — the theory behind BM25
|
||||
- Nogueira & Cho, *Passage Re-ranking with BERT* (2019) — cross-encoder re-ranking applied to information retrieval
|
||||
- Stephen Robertson and Hugo Zaragoza. 2009. *The Probabilistic Relevance Framework: BM25 and Beyond*. Foundations and Trends in Information Retrieval 3, 4 (April 2009), 333-389. https://doi.org/10.1561/1500000019 — the theoretical basis for BM25.
|
||||
- Rodrigo Nogueira and Kyunghyun Cho. 2019. Passage Re-ranking with BERT. arXiv:1901.04085. https://arxiv.org/abs/1901.04085 — cross-encoder re-ranking applied to information retrieval; the approach we use in `query_hybrid.py`.
|
||||
- Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In *Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing*, 3982-3992. https://arxiv.org/abs/1908.10084 — the foundational paper for `sentence-transformers`, the library behind both our embedding and cross-encoder models.
|
||||
|
|
|
|||
|
|
@ -9,6 +9,14 @@
|
|||
# E. M. Furst
|
||||
# Used Sonnet 4.5 to suggest changes; Opus 4.6 for incremental update
|
||||
|
||||
# Environment vars must be set before importing huggingface/transformers
|
||||
# libraries, because huggingface_hub.constants evaluates HF_HUB_OFFLINE
|
||||
# at import time.
|
||||
import os
|
||||
os.environ["TOKENIZERS_PARALLELISM"] = "false"
|
||||
os.environ["SENTENCE_TRANSFORMERS_HOME"] = "./models"
|
||||
os.environ["HF_HUB_OFFLINE"] = "1"
|
||||
|
||||
from llama_index.core import (
|
||||
SimpleDirectoryReader,
|
||||
StorageContext,
|
||||
|
|
@ -21,7 +29,6 @@ from llama_index.embeddings.huggingface import HuggingFaceEmbedding
|
|||
from llama_index.core.node_parser import SentenceSplitter
|
||||
import argparse
|
||||
import datetime
|
||||
import os
|
||||
import time
|
||||
|
||||
# Shared constants
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue