Eric Furst f7d2b48f5a README updates, textbook polynomial cell, self-contained notebook

Same set of changes as che-computing-dev/LLMs:
- 03/04/05 READMEs: uv add workflow, required model caching
- 05-tool-use: add Setup section, requirements.txt
- 06-neural-networks: textbook cubic polynomial comparison cell
- 06-neural-networks: add nn_workshop_colab.ipynb (self-contained, inline data)
- vocab.md: catch up with terms from 02-05

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-05-04 10:18:10 -04:00

15 KiB

Raw Blame History

Large Language Models Part IV: Advanced Retrieval and Semantic Search

CHEG 667-013 — Chemical Engineering with Computers
Department of Chemical and Biomolecular Engineering, University of Delaware

Key idea

Build a more effective search system by combining multiple retrieval strategies and re-ranking results.

Key goals

Understand why simple vector search sometimes misses relevant results
Combine vector similarity with keyword matching (hybrid retrieval)
Use a cross-encoder to re-rank candidates
Compare LLM-synthesized answers with raw chunk retrieval

This is an advanced topic that builds on Part III (RAG). Make sure you are comfortable with building a vector store and querying it before proceeding.

In Part III, we built a RAG system that embedded documents, retrieved the most similar chunks, and passed them to an LLM. That pipeline works well for many queries — but it has blind spots.

Consider searching for a specific person's name, a date, or a technical term. Vector embeddings capture meaning, not exact strings. A query for "Dr. Rodriguez" might retrieve chunks about "faculty" or "professors" instead of chunks that literally contain the name. Similarly, a query about "October 2020" might return chunks about autumn events in general.

This tutorial introduces three improvements:

Hybrid retrieval — combine vector similarity (good at meaning) with BM25 keyword matching (good at exact terms)
Cross-encoder re-ranking — use a second model to score each (query, chunk) pair more carefully
Raw retrieval mode — inspect what the pipeline retrieves before the LLM sees it

The result is a more effective search system that catches both semantic matches and exact-term matches.

1. How hybrid retrieval works

In Part III, our pipeline was:

Query → Embed → Vector similarity (top 15) → LLM → Response

The improved pipeline is:

Query → Embed ──→ Vector similarity (top 20) ──┐
                                                ├─→ Merge & deduplicate → Cross-encoder re-rank (top 15) → LLM → Response
Query → Tokenize → BM25 term matching (top 20) ┘

Vector retrieval (dense)

This is what we used in Part III. The query is embedded into a vector, and the most similar chunk vectors are returned. This catches semantic matches — chunks with similar meaning, even if the words are different.

BM25 retrieval (sparse)

BM25 is a classical information retrieval algorithm based on term frequency. It scores documents by how often the query's words appear, adjusted for document length. It's fast, requires no embeddings, and excels at finding exact names, dates, and technical terms that embeddings might miss.

Why combine them?

Neither retriever is perfect alone:

Query type	Vector	BM25
"documents about campus safety"	Good — captures meaning	Decent — matches "safety"
"Dr. Rodriguez"	Weak — embeds as "person" concept	Strong — matches exact name
"feelings of joy and accomplishment"	Strong — semantic match	Weak — might miss synonyms like "pride"
"October 2020 announcement"	Moderate	Strong — matches exact date

By retrieving candidates from both and merging them, we get a broader candidate pool that covers both semantic and lexical matches.

Cross-encoder re-ranking

The merged candidates might number 30–40 chunks. We don't want to send all of them to the LLM — that wastes context and dilutes quality. A cross-encoder solves this by scoring each (query, chunk) pair directly.

Unlike the bi-encoder embedding model (which encodes query and chunk separately), a cross-encoder reads the query and chunk together and produces a relevance score. This is more accurate but slower — which is why we use it as a second stage on a small candidate set, not on the entire corpus.

We use cross-encoder/ms-marco-MiniLM-L-12-v2 to re-rank the merged candidates down to the top 15 before passing them to the LLM.

2. Setup

Prerequisites

Everything from Part III, plus a few additional packages. Each section has its own requirements.txt.

If you are using uv for the workshop (recommended):

cd /path/to/llm-workshop
uv add $(cat 04-semantic-search/requirements.txt)

With a plain venv:

pip install -r requirements.txt

The new packages over Part III are llama-index-retrievers-bm25 (BM25 keyword retrieval) and nltk (used by search_keywords.py for part-of-speech tagging).

Cache the models

Required. This section uses two models: the embedding model from Part III (cached if you ran cache_model.py already) and a cross-encoder for re-ranking. Both must be cached before build_store.py and query_hybrid.py will run, since the scripts run in offline mode.

If you have already cached the embedding model in Part III, point this section at it:

cd 04-semantic-search
ln -s ../03-rag/models models

Then pre-cache the cross-encoder (cross-encoder/ms-marco-MiniLM-L-12-v2, ~130 MB) by running a similar one-shot script (or temporarily set HF_HUB_OFFLINE=0 in your shell for the first run of build_store.py).

If you have not cached the embedding model yet, run python ../03-rag/cache_model.py first.

Pull the LLM

Make sure ollama is running and command-r7b is available:

ollama pull command-r7b

Libraries and environment variables

This section uses the same three-layer architecture introduced in Part III, LlamaIndex for orchestration, Hugging Face for the embedding and cross-encoder models, and Ollama for response generation, plus one new piece: llama-index-retrievers-bm25 for keyword-based retrieval. BM25 is a classical, non-neural algorithm that complements the neural embedding model.

The same Settings-based configuration applies, and the same environment-variable pattern is used at the top of every script:

import os
os.environ["TOKENIZERS_PARALLELISM"] = "false"
os.environ["SENTENCE_TRANSFORMERS_HOME"] = "./models"
os.environ["HF_HUB_OFFLINE"] = "1"

These come before any LlamaIndex or Hugging Face imports because the libraries read the environment at import time. HF_HUB_OFFLINE=1 prevents the "sending unauthenticated requests to the HF Hub" warning and makes runs deterministic. Remove it temporarily if you want to download a fresh model. See Part III, section 2 for the full explanation.

3. Building the vector store

A larger corpus

The hybrid retrieval and re-ranking pipeline only earns its keep on a corpus large enough that retrieval is genuinely selective. The 10 emails from Part III are too few -- vector and BM25 will return almost the same chunks every time. For this section, we recommend a 100-abstract arXiv corpus.

If you did Exercise 8 in Part III, you already have one. Otherwise, run:

python ../03-rag/fetch_arxiv.py --category cs.LG --max 100 --output data

This populates ./data with 100 recent papers from cs.LG (machine learning). Other relevant categories:

physics.chem-ph -- chemical physics
cond-mat.soft -- soft matter
physics.flu-dyn -- fluid dynamics
cs.AI -- artificial intelligence

You can also drop in your own collection: NIST data sheets, CCPS process safety case studies (https://www.aiche.org/ccps/resources), US Chemical Safety Board incident reports (https://www.csb.gov/investigations/), or any other text-format documents. PDFs work too -- SimpleDirectoryReader reads them automatically when llama-index-readers-file is installed.

Building the index

The build_store.py script works like the one in Part III, with a few differences:

Smaller chunks: 256 tokens (vs. 500 in Part III) with 25 tokens of overlap
Incremental updates: by default, it only re-indexes new or modified files
Full rebuild: use --rebuild to start from scratch

python build_store.py --rebuild

Or for incremental updates after adding new files:

python build_store.py

Mode: incremental update
Loading existing index from ./store...
Index contains 42 documents
Data directory contains 44 files

  New:       2
  Modified:  0
  Deleted:   0
  Unchanged: 42

Indexing 2 file(s)...
Index updated and saved to ./store

Why smaller chunks?

In Part III we used 500-token chunks. Here we use 256. Smaller chunks are more precise — each one represents a more focused piece of text. With a re-ranker to sort them, precision matters more than capturing broad context in a single chunk. The tradeoff: you get more chunks to search through, and each one has less surrounding context.

Exercise 1: Rebuild the store with different chunk sizes (128, 256, 512, 1024). How does the number of chunks change? How does it affect retrieval quality?

4. Querying with hybrid retrieval

The query_hybrid.py script implements the full hybrid pipeline:

python query_hybrid.py "Find documents about campus safety"

The output shows retrieval statistics before the LLM response:

Query: Find documents about campus safety
Vector: 20, BM25: 20, overlap: 8, merged: 32, re-ranked to: 15

Response:
...

This tells you:

20 candidates came from vector similarity
20 came from BM25
8 were found by both (overlap)
32 unique candidates after merging
Re-ranked down to 15 for the LLM

Exercise 2: Run the same query using Part III's query.py (pure vector retrieval) and this tutorial's query_hybrid.py. Compare the source documents listed. Did hybrid retrieval find anything that pure vector missed?

5. Raw retrieval without an LLM

Sometimes you want to see exactly what the retrieval pipeline found, without the LLM summarizing or rephrasing. The retrieve.py script runs the same hybrid retrieval and re-ranking, but outputs the raw chunk text instead of passing it to an LLM:

python retrieve.py "Dr. Rodriguez"

Query: Dr. Rodriguez
Vector: 20, BM25: 20, overlap: 3, merged: 37, re-ranked to: 15
  vector-only: 17, bm25-only: 17, both: 3

================================================================================
=== [1] 2024_08_26_100859.txt  (score: 0.847)  [bm25-only]
================================================================================
Dr. Rodriguez spoke at the opening ceremony, emphasizing the
university's commitment to inclusive excellence...

================================================================================
=== [2] 2023_10_12_155349.txt  (score: 0.712)  [vector+bm25]
================================================================================
...

Each chunk is annotated with its source: vector-only, bm25-only, or vector+bm25. This lets you see which retriever nominated each result.

This is invaluable for debugging. If your LLM response seems off, check the raw retrieval first — the problem is often in what was retrieved, not how the LLM synthesized it.

Exercise 3: Run retrieve.py with a query that includes a specific name or date. How many of the top results are bm25-only? What would have been missed with pure vector retrieval?

6. Keyword search

For a complementary approach, search_keywords.py does pure keyword matching with no embeddings at all. It uses NLTK part-of-speech tagging to extract meaningful terms from your query, then searches the raw text files with regex:

python search_keywords.py "Hurricane Sandy recovery efforts"

Query: Hurricane Sandy recovery efforts
Extracted terms: hurricane sandy, recovery, efforts

Found 12 matches across 3 files

============================================================
--- 2012_11_02_164248.txt  (5 matches) ---
============================================================
  >>> 12: Hurricane Sandy has caused significant damage to our campus...
  ...

This is a fallback when you know exactly what you're looking for and don't need semantic matching. It's also fast — no models, no vector store needed.

Exercise 4: Compare the results of search_keywords.py, retrieve.py, and query_hybrid.py on the same query. When is each approach most useful?

7. Comparing the three query modes

Script	Method	Uses LLM?	Best for
`query_hybrid.py`	Hybrid (vector + BM25) + re-rank + LLM	Yes	Synthesized answers from documents
`retrieve.py`	Hybrid (vector + BM25) + re-rank	No	Inspecting raw retrieval results
`search_keywords.py`	POS-tagged keyword matching	No	Finding exact names, dates, terms

8. Exercises

Exercise 5: The hybrid retrieval uses VECTOR_TOP_K=20 and BM25_TOP_K=20. Experiment with different values. What happens if you set BM25 to 0 (effectively disabling it)? What about setting vector to 0?

Exercise 6: Change the re-ranker's RERANK_TOP_N from 15 to 5. How does this affect response quality? What about 30?

Exercise 7: Modify the prompt in query_hybrid.py. Try asking the model to respond as a specific persona, or to format the output differently (e.g., as a timeline, or as bullet points).

Exercise 8: Build this system over your own document collection — class notes, research papers, or a downloaded text corpus. Which retrieval mode works best for your documents?

Exercise 9: The cross-encoder we use (cross-encoder/ms-marco-MiniLM-L-12-v2) and the embedding model (BAAI/bge-large-en-v1.5) date from 2022-2024. Newer models are likely available. Browse the MTEB leaderboard for current top embedding models and re-rankers. Swap one in (you will need a full rebuild if you change the embedding model). Does retrieval quality improve? At what cost in model size and speed? This is a recurring task in production systems — models age, and the right answer in 2024 is not the right answer today.

Additional resources and references

LlamaIndex

Documentation: https://docs.llamaindex.ai/en/stable/
BM25 retriever: https://docs.llamaindex.ai/en/stable/examples/retrievers/bm25_retriever/

Models

Ollama: https://ollama.com
Huggingface models: https://huggingface.co/models

Models used in this tutorial

Model	Type	Role	Source
`command-r7b`	LLM (RAG-optimized)	Response generation	`ollama pull command-r7b`
`BAAI/bge-large-en-v1.5`	Embedding (1024-dim)	Text -> vector encoding	Huggingface (auto-downloaded)
`cross-encoder/ms-marco-MiniLM-L-12-v2`	Cross-encoder	Re-ranking candidates	Huggingface (auto-downloaded)

15 KiB Raw Blame History Unescape Escape