llm-workshop/04-semantic-search/README.md
Eric 1604671d36 Initial commit: LLM workshop materials
Five modules covering nanoGPT, Ollama, RAG, semantic search, and neural networks.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-28 07:11:01 -04:00

11 KiB
Raw Permalink Blame History

Large Language Models Part IV: Advanced Retrieval and Semantic Search

CHEG 667-013 — Chemical Engineering with Computers
Department of Chemical and Biomolecular Engineering, University of Delaware


Key idea

Build a more effective search system by combining multiple retrieval strategies and re-ranking results.

Key goals

  • Understand why simple vector search sometimes misses relevant results
  • Combine vector similarity with keyword matching (hybrid retrieval)
  • Use a cross-encoder to re-rank candidates
  • Compare LLM-synthesized answers with raw chunk retrieval

This is an advanced topic that builds on Part III (RAG). Make sure you are comfortable with building a vector store and querying it before proceeding.

In Part III, we built a RAG system that embedded documents, retrieved the most similar chunks, and passed them to an LLM. That pipeline works well for many queries — but it has blind spots.

Consider searching for a specific person's name, a date, or a technical term. Vector embeddings capture meaning, not exact strings. A query for "Dr. Rodriguez" might retrieve chunks about "faculty" or "professors" instead of chunks that literally contain the name. Similarly, a query about "October 2020" might return chunks about autumn events in general.

This tutorial introduces three improvements:

  1. Hybrid retrieval — combine vector similarity (good at meaning) with BM25 keyword matching (good at exact terms)
  2. Cross-encoder re-ranking — use a second model to score each (query, chunk) pair more carefully
  3. Raw retrieval mode — inspect what the pipeline retrieves before the LLM sees it

The result is a more effective search system that catches both semantic matches and exact-term matches.

1. How hybrid retrieval works

In Part III, our pipeline was:

Query → Embed → Vector similarity (top 15) → LLM → Response

The improved pipeline is:

Query → Embed ──→ Vector similarity (top 20) ──┐
                                                ├─→ Merge & deduplicate → Cross-encoder re-rank (top 15) → LLM → Response
Query → Tokenize → BM25 term matching (top 20) ┘

Vector retrieval (dense)

This is what we used in Part III. The query is embedded into a vector, and the most similar chunk vectors are returned. This catches semantic matches — chunks with similar meaning, even if the words are different.

BM25 retrieval (sparse)

BM25 is a classical information retrieval algorithm based on term frequency. It scores documents by how often the query's words appear, adjusted for document length. It's fast, requires no embeddings, and excels at finding exact names, dates, and technical terms that embeddings might miss.

Why combine them?

Neither retriever is perfect alone:

Query type Vector BM25
"documents about campus safety" Good — captures meaning Decent — matches "safety"
"Dr. Rodriguez" Weak — embeds as "person" concept Strong — matches exact name
"feelings of joy and accomplishment" Strong — semantic match Weak — might miss synonyms like "pride"
"October 2020 announcement" Moderate Strong — matches exact date

By retrieving candidates from both and merging them, we get a broader candidate pool that covers both semantic and lexical matches.

Cross-encoder re-ranking

The merged candidates might number 3040 chunks. We don't want to send all of them to the LLM — that wastes context and dilutes quality. A cross-encoder solves this by scoring each (query, chunk) pair directly.

Unlike the bi-encoder embedding model (which encodes query and chunk separately), a cross-encoder reads the query and chunk together and produces a relevance score. This is more accurate but slower — which is why we use it as a second stage on a small candidate set, not on the entire corpus.

We use cross-encoder/ms-marco-MiniLM-L-12-v2 to re-rank the merged candidates down to the top 15 before passing them to the LLM.

2. Setup

Prerequisites

Everything from Part III, plus a few additional packages:

pip install llama-index-retrievers-bm25 nltk

A requirements.txt is provided with the full set of dependencies:

pip install -r requirements.txt

The cross-encoder model (cross-encoder/ms-marco-MiniLM-L-12-v2) will download automatically on first use via sentence-transformers. It is small (~130 MB).

Make sure ollama is running and command-r7b is available:

ollama pull command-r7b

3. Building the vector store

The build_store.py script works like the one in Part III, with a few differences:

  • Smaller chunks: 256 tokens (vs. 500 in Part III) with 25 tokens of overlap
  • Incremental updates: by default, it only re-indexes new or modified files
  • Full rebuild: use --rebuild to start from scratch
python build_store.py --rebuild

Or for incremental updates after adding new files:

python build_store.py
Mode: incremental update
Loading existing index from ./store...
Index contains 42 documents
Data directory contains 44 files

  New:       2
  Modified:  0
  Deleted:   0
  Unchanged: 42

Indexing 2 file(s)...
Index updated and saved to ./store

Why smaller chunks?

In Part III we used 500-token chunks. Here we use 256. Smaller chunks are more precise — each one represents a more focused piece of text. With a re-ranker to sort them, precision matters more than capturing broad context in a single chunk. The tradeoff: you get more chunks to search through, and each one has less surrounding context.

Exercise 1: Rebuild the store with different chunk sizes (128, 256, 512, 1024). How does the number of chunks change? How does it affect retrieval quality?

4. Querying with hybrid retrieval

The query_hybrid.py script implements the full hybrid pipeline:

python query_hybrid.py "Find documents about campus safety"

The output shows retrieval statistics before the LLM response:

Query: Find documents about campus safety
Vector: 20, BM25: 20, overlap: 8, merged: 32, re-ranked to: 15

Response:
...

This tells you:

  • 20 candidates came from vector similarity
  • 20 came from BM25
  • 8 were found by both (overlap)
  • 32 unique candidates after merging
  • Re-ranked down to 15 for the LLM

Exercise 2: Run the same query using Part III's query.py (pure vector retrieval) and this tutorial's query_hybrid.py. Compare the source documents listed. Did hybrid retrieval find anything that pure vector missed?

5. Raw retrieval without an LLM

Sometimes you want to see exactly what the retrieval pipeline found, without the LLM summarizing or rephrasing. The retrieve.py script runs the same hybrid retrieval and re-ranking, but outputs the raw chunk text instead of passing it to an LLM:

python retrieve.py "Dr. Rodriguez"
Query: Dr. Rodriguez
Vector: 20, BM25: 20, overlap: 3, merged: 37, re-ranked to: 15
  vector-only: 17, bm25-only: 17, both: 3

================================================================================
=== [1] 2024_08_26_100859.txt  (score: 0.847)  [bm25-only]
================================================================================
Dr. Rodriguez spoke at the opening ceremony, emphasizing the
university's commitment to inclusive excellence...

================================================================================
=== [2] 2023_10_12_155349.txt  (score: 0.712)  [vector+bm25]
================================================================================
...

Each chunk is annotated with its source: vector-only, bm25-only, or vector+bm25. This lets you see which retriever nominated each result.

This is invaluable for debugging. If your LLM response seems off, check the raw retrieval first — the problem is often in what was retrieved, not how the LLM synthesized it.

Exercise 3: Run retrieve.py with a query that includes a specific name or date. How many of the top results are bm25-only? What would have been missed with pure vector retrieval?

For a complementary approach, search_keywords.py does pure keyword matching with no embeddings at all. It uses NLTK part-of-speech tagging to extract meaningful terms from your query, then searches the raw text files with regex:

python search_keywords.py "Hurricane Sandy recovery efforts"
Query: Hurricane Sandy recovery efforts
Extracted terms: hurricane sandy, recovery, efforts

Found 12 matches across 3 files

============================================================
--- 2012_11_02_164248.txt  (5 matches) ---
============================================================
  >>> 12: Hurricane Sandy has caused significant damage to our campus...
  ...

This is a fallback when you know exactly what you're looking for and don't need semantic matching. It's also fast — no models, no vector store needed.

Exercise 4: Compare the results of search_keywords.py, retrieve.py, and query_hybrid.py on the same query. When is each approach most useful?

7. Comparing the three query modes

Script Method Uses LLM? Best for
query_hybrid.py Hybrid (vector + BM25) + re-rank + LLM Yes Synthesized answers from documents
retrieve.py Hybrid (vector + BM25) + re-rank No Inspecting raw retrieval results
search_keywords.py POS-tagged keyword matching No Finding exact names, dates, terms

8. Exercises

Exercise 5: The hybrid retrieval uses VECTOR_TOP_K=20 and BM25_TOP_K=20. Experiment with different values. What happens if you set BM25 to 0 (effectively disabling it)? What about setting vector to 0?

Exercise 6: Change the re-ranker's RERANK_TOP_N from 15 to 5. How does this affect response quality? What about 30?

Exercise 7: Modify the prompt in query_hybrid.py. Try asking the model to respond as a specific persona, or to format the output differently (e.g., as a timeline, or as bullet points).

Exercise 8: Build this system over your own document collection — class notes, research papers, or a downloaded text corpus. Which retrieval mode works best for your documents?

Additional resources and references

LlamaIndex

Models

Models used in this tutorial

Model Type Role Source
command-r7b LLM (RAG-optimized) Response generation ollama pull command-r7b
BAAI/bge-large-en-v1.5 Embedding (1024-dim) Text -> vector encoding Huggingface (auto-downloaded)
cross-encoder/ms-marco-MiniLM-L-12-v2 Cross-encoder Re-ranking candidates Huggingface (auto-downloaded)

Further reading

  • Robertson & Zaragoza, The Probabilistic Relevance Framework: BM25 and Beyond (2009) — the theory behind BM25
  • Nogueira & Cho, Passage Re-ranking with BERT (2019) — cross-encoder re-ranking applied to information retrieval