llm-workshop/04-semantic-search/README.md
Eric Furst f7d2b48f5a README updates, textbook polynomial cell, self-contained notebook
Same set of changes as che-computing-dev/LLMs:
- 03/04/05 READMEs: uv add workflow, required model caching
- 05-tool-use: add Setup section, requirements.txt
- 06-neural-networks: textbook cubic polynomial comparison cell
- 06-neural-networks: add nn_workshop_colab.ipynb (self-contained, inline data)
- vocab.md: catch up with terms from 02-05

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-04 10:18:10 -04:00

15 KiB
Raw Blame History

Large Language Models Part IV: Advanced Retrieval and Semantic Search

CHEG 667-013 — Chemical Engineering with Computers
Department of Chemical and Biomolecular Engineering, University of Delaware


Key idea

Build a more effective search system by combining multiple retrieval strategies and re-ranking results.

Key goals

  • Understand why simple vector search sometimes misses relevant results
  • Combine vector similarity with keyword matching (hybrid retrieval)
  • Use a cross-encoder to re-rank candidates
  • Compare LLM-synthesized answers with raw chunk retrieval

This is an advanced topic that builds on Part III (RAG). Make sure you are comfortable with building a vector store and querying it before proceeding.

In Part III, we built a RAG system that embedded documents, retrieved the most similar chunks, and passed them to an LLM. That pipeline works well for many queries — but it has blind spots.

Consider searching for a specific person's name, a date, or a technical term. Vector embeddings capture meaning, not exact strings. A query for "Dr. Rodriguez" might retrieve chunks about "faculty" or "professors" instead of chunks that literally contain the name. Similarly, a query about "October 2020" might return chunks about autumn events in general.

This tutorial introduces three improvements:

  1. Hybrid retrieval — combine vector similarity (good at meaning) with BM25 keyword matching (good at exact terms)
  2. Cross-encoder re-ranking — use a second model to score each (query, chunk) pair more carefully
  3. Raw retrieval mode — inspect what the pipeline retrieves before the LLM sees it

The result is a more effective search system that catches both semantic matches and exact-term matches.

1. How hybrid retrieval works

In Part III, our pipeline was:

Query → Embed → Vector similarity (top 15) → LLM → Response

The improved pipeline is:

Query → Embed ──→ Vector similarity (top 20) ──┐
                                                ├─→ Merge & deduplicate → Cross-encoder re-rank (top 15) → LLM → Response
Query → Tokenize → BM25 term matching (top 20) ┘

Vector retrieval (dense)

This is what we used in Part III. The query is embedded into a vector, and the most similar chunk vectors are returned. This catches semantic matches — chunks with similar meaning, even if the words are different.

BM25 retrieval (sparse)

BM25 is a classical information retrieval algorithm based on term frequency. It scores documents by how often the query's words appear, adjusted for document length. It's fast, requires no embeddings, and excels at finding exact names, dates, and technical terms that embeddings might miss.

Why combine them?

Neither retriever is perfect alone:

Query type Vector BM25
"documents about campus safety" Good — captures meaning Decent — matches "safety"
"Dr. Rodriguez" Weak — embeds as "person" concept Strong — matches exact name
"feelings of joy and accomplishment" Strong — semantic match Weak — might miss synonyms like "pride"
"October 2020 announcement" Moderate Strong — matches exact date

By retrieving candidates from both and merging them, we get a broader candidate pool that covers both semantic and lexical matches.

Cross-encoder re-ranking

The merged candidates might number 3040 chunks. We don't want to send all of them to the LLM — that wastes context and dilutes quality. A cross-encoder solves this by scoring each (query, chunk) pair directly.

Unlike the bi-encoder embedding model (which encodes query and chunk separately), a cross-encoder reads the query and chunk together and produces a relevance score. This is more accurate but slower — which is why we use it as a second stage on a small candidate set, not on the entire corpus.

We use cross-encoder/ms-marco-MiniLM-L-12-v2 to re-rank the merged candidates down to the top 15 before passing them to the LLM.

2. Setup

Prerequisites

Everything from Part III, plus a few additional packages. Each section has its own requirements.txt.

If you are using uv for the workshop (recommended):

cd /path/to/llm-workshop
uv add $(cat 04-semantic-search/requirements.txt)

With a plain venv:

pip install -r requirements.txt

The new packages over Part III are llama-index-retrievers-bm25 (BM25 keyword retrieval) and nltk (used by search_keywords.py for part-of-speech tagging).

Cache the models

Required. This section uses two models: the embedding model from Part III (cached if you ran cache_model.py already) and a cross-encoder for re-ranking. Both must be cached before build_store.py and query_hybrid.py will run, since the scripts run in offline mode.

If you have already cached the embedding model in Part III, point this section at it:

cd 04-semantic-search
ln -s ../03-rag/models models

Then pre-cache the cross-encoder (cross-encoder/ms-marco-MiniLM-L-12-v2, ~130 MB) by running a similar one-shot script (or temporarily set HF_HUB_OFFLINE=0 in your shell for the first run of build_store.py).

If you have not cached the embedding model yet, run python ../03-rag/cache_model.py first.

Pull the LLM

Make sure ollama is running and command-r7b is available:

ollama pull command-r7b

Libraries and environment variables

This section uses the same three-layer architecture introduced in Part III, LlamaIndex for orchestration, Hugging Face for the embedding and cross-encoder models, and Ollama for response generation, plus one new piece: llama-index-retrievers-bm25 for keyword-based retrieval. BM25 is a classical, non-neural algorithm that complements the neural embedding model.

The same Settings-based configuration applies, and the same environment-variable pattern is used at the top of every script:

import os
os.environ["TOKENIZERS_PARALLELISM"] = "false"
os.environ["SENTENCE_TRANSFORMERS_HOME"] = "./models"
os.environ["HF_HUB_OFFLINE"] = "1"

These come before any LlamaIndex or Hugging Face imports because the libraries read the environment at import time. HF_HUB_OFFLINE=1 prevents the "sending unauthenticated requests to the HF Hub" warning and makes runs deterministic. Remove it temporarily if you want to download a fresh model. See Part III, section 2 for the full explanation.

3. Building the vector store

A larger corpus

The hybrid retrieval and re-ranking pipeline only earns its keep on a corpus large enough that retrieval is genuinely selective. The 10 emails from Part III are too few -- vector and BM25 will return almost the same chunks every time. For this section, we recommend a 100-abstract arXiv corpus.

If you did Exercise 8 in Part III, you already have one. Otherwise, run:

python ../03-rag/fetch_arxiv.py --category cs.LG --max 100 --output data

This populates ./data with 100 recent papers from cs.LG (machine learning). Other relevant categories:

  • physics.chem-ph -- chemical physics
  • cond-mat.soft -- soft matter
  • physics.flu-dyn -- fluid dynamics
  • cs.AI -- artificial intelligence

You can also drop in your own collection: NIST data sheets, CCPS process safety case studies (https://www.aiche.org/ccps/resources), US Chemical Safety Board incident reports (https://www.csb.gov/investigations/), or any other text-format documents. PDFs work too -- SimpleDirectoryReader reads them automatically when llama-index-readers-file is installed.

Building the index

The build_store.py script works like the one in Part III, with a few differences:

  • Smaller chunks: 256 tokens (vs. 500 in Part III) with 25 tokens of overlap
  • Incremental updates: by default, it only re-indexes new or modified files
  • Full rebuild: use --rebuild to start from scratch
python build_store.py --rebuild

Or for incremental updates after adding new files:

python build_store.py
Mode: incremental update
Loading existing index from ./store...
Index contains 42 documents
Data directory contains 44 files

  New:       2
  Modified:  0
  Deleted:   0
  Unchanged: 42

Indexing 2 file(s)...
Index updated and saved to ./store

Why smaller chunks?

In Part III we used 500-token chunks. Here we use 256. Smaller chunks are more precise — each one represents a more focused piece of text. With a re-ranker to sort them, precision matters more than capturing broad context in a single chunk. The tradeoff: you get more chunks to search through, and each one has less surrounding context.

Exercise 1: Rebuild the store with different chunk sizes (128, 256, 512, 1024). How does the number of chunks change? How does it affect retrieval quality?

4. Querying with hybrid retrieval

The query_hybrid.py script implements the full hybrid pipeline:

python query_hybrid.py "Find documents about campus safety"

The output shows retrieval statistics before the LLM response:

Query: Find documents about campus safety
Vector: 20, BM25: 20, overlap: 8, merged: 32, re-ranked to: 15

Response:
...

This tells you:

  • 20 candidates came from vector similarity
  • 20 came from BM25
  • 8 were found by both (overlap)
  • 32 unique candidates after merging
  • Re-ranked down to 15 for the LLM

Exercise 2: Run the same query using Part III's query.py (pure vector retrieval) and this tutorial's query_hybrid.py. Compare the source documents listed. Did hybrid retrieval find anything that pure vector missed?

5. Raw retrieval without an LLM

Sometimes you want to see exactly what the retrieval pipeline found, without the LLM summarizing or rephrasing. The retrieve.py script runs the same hybrid retrieval and re-ranking, but outputs the raw chunk text instead of passing it to an LLM:

python retrieve.py "Dr. Rodriguez"
Query: Dr. Rodriguez
Vector: 20, BM25: 20, overlap: 3, merged: 37, re-ranked to: 15
  vector-only: 17, bm25-only: 17, both: 3

================================================================================
=== [1] 2024_08_26_100859.txt  (score: 0.847)  [bm25-only]
================================================================================
Dr. Rodriguez spoke at the opening ceremony, emphasizing the
university's commitment to inclusive excellence...

================================================================================
=== [2] 2023_10_12_155349.txt  (score: 0.712)  [vector+bm25]
================================================================================
...

Each chunk is annotated with its source: vector-only, bm25-only, or vector+bm25. This lets you see which retriever nominated each result.

This is invaluable for debugging. If your LLM response seems off, check the raw retrieval first — the problem is often in what was retrieved, not how the LLM synthesized it.

Exercise 3: Run retrieve.py with a query that includes a specific name or date. How many of the top results are bm25-only? What would have been missed with pure vector retrieval?

For a complementary approach, search_keywords.py does pure keyword matching with no embeddings at all. It uses NLTK part-of-speech tagging to extract meaningful terms from your query, then searches the raw text files with regex:

python search_keywords.py "Hurricane Sandy recovery efforts"
Query: Hurricane Sandy recovery efforts
Extracted terms: hurricane sandy, recovery, efforts

Found 12 matches across 3 files

============================================================
--- 2012_11_02_164248.txt  (5 matches) ---
============================================================
  >>> 12: Hurricane Sandy has caused significant damage to our campus...
  ...

This is a fallback when you know exactly what you're looking for and don't need semantic matching. It's also fast — no models, no vector store needed.

Exercise 4: Compare the results of search_keywords.py, retrieve.py, and query_hybrid.py on the same query. When is each approach most useful?

7. Comparing the three query modes

Script Method Uses LLM? Best for
query_hybrid.py Hybrid (vector + BM25) + re-rank + LLM Yes Synthesized answers from documents
retrieve.py Hybrid (vector + BM25) + re-rank No Inspecting raw retrieval results
search_keywords.py POS-tagged keyword matching No Finding exact names, dates, terms

8. Exercises

Exercise 5: The hybrid retrieval uses VECTOR_TOP_K=20 and BM25_TOP_K=20. Experiment with different values. What happens if you set BM25 to 0 (effectively disabling it)? What about setting vector to 0?

Exercise 6: Change the re-ranker's RERANK_TOP_N from 15 to 5. How does this affect response quality? What about 30?

Exercise 7: Modify the prompt in query_hybrid.py. Try asking the model to respond as a specific persona, or to format the output differently (e.g., as a timeline, or as bullet points).

Exercise 8: Build this system over your own document collection — class notes, research papers, or a downloaded text corpus. Which retrieval mode works best for your documents?

Exercise 9: The cross-encoder we use (cross-encoder/ms-marco-MiniLM-L-12-v2) and the embedding model (BAAI/bge-large-en-v1.5) date from 2022-2024. Newer models are likely available. Browse the MTEB leaderboard for current top embedding models and re-rankers. Swap one in (you will need a full rebuild if you change the embedding model). Does retrieval quality improve? At what cost in model size and speed? This is a recurring task in production systems — models age, and the right answer in 2024 is not the right answer today.

Additional resources and references

LlamaIndex

Models

Models used in this tutorial

Model Type Role Source
command-r7b LLM (RAG-optimized) Response generation ollama pull command-r7b
BAAI/bge-large-en-v1.5 Embedding (1024-dim) Text -> vector encoding Huggingface (auto-downloaded)
cross-encoder/ms-marco-MiniLM-L-12-v2 Cross-encoder Re-ranking candidates Huggingface (auto-downloaded)

Further reading

  • Stephen Robertson and Hugo Zaragoza. 2009. The Probabilistic Relevance Framework: BM25 and Beyond. Foundations and Trends in Information Retrieval 3, 4 (April 2009), 333-389. https://doi.org/10.1561/1500000019 — the theoretical basis for BM25.
  • Rodrigo Nogueira and Kyunghyun Cho. 2019. Passage Re-ranking with BERT. arXiv:1901.04085. https://arxiv.org/abs/1901.04085 — cross-encoder re-ranking applied to information retrieval; the approach we use in query_hybrid.py.
  • Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing, 3982-3992. https://arxiv.org/abs/1908.10084 — the foundational paper for sentence-transformers, the library behind both our embedding and cross-encoder models.