Eric 1604671d36 Initial commit: LLM workshop materials

Five modules covering nanoGPT, Ollama, RAG, semantic search, and neural networks.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-03-28 07:11:01 -04:00

11 KiB

Raw Permalink Blame History

Large Language Models Part IV: Advanced Retrieval and Semantic Search

CHEG 667-013 — Chemical Engineering with Computers
Department of Chemical and Biomolecular Engineering, University of Delaware

Key idea

Build a more effective search system by combining multiple retrieval strategies and re-ranking results.

Key goals

Understand why simple vector search sometimes misses relevant results
Combine vector similarity with keyword matching (hybrid retrieval)
Use a cross-encoder to re-rank candidates
Compare LLM-synthesized answers with raw chunk retrieval

This is an advanced topic that builds on Part III (RAG). Make sure you are comfortable with building a vector store and querying it before proceeding.

In Part III, we built a RAG system that embedded documents, retrieved the most similar chunks, and passed them to an LLM. That pipeline works well for many queries — but it has blind spots.

Consider searching for a specific person's name, a date, or a technical term. Vector embeddings capture meaning, not exact strings. A query for "Dr. Rodriguez" might retrieve chunks about "faculty" or "professors" instead of chunks that literally contain the name. Similarly, a query about "October 2020" might return chunks about autumn events in general.

This tutorial introduces three improvements:

Hybrid retrieval — combine vector similarity (good at meaning) with BM25 keyword matching (good at exact terms)
Cross-encoder re-ranking — use a second model to score each (query, chunk) pair more carefully
Raw retrieval mode — inspect what the pipeline retrieves before the LLM sees it

The result is a more effective search system that catches both semantic matches and exact-term matches.

1. How hybrid retrieval works

In Part III, our pipeline was:

Query → Embed → Vector similarity (top 15) → LLM → Response

The improved pipeline is:

Query → Embed ──→ Vector similarity (top 20) ──┐
                                                ├─→ Merge & deduplicate → Cross-encoder re-rank (top 15) → LLM → Response
Query → Tokenize → BM25 term matching (top 20) ┘

Vector retrieval (dense)

This is what we used in Part III. The query is embedded into a vector, and the most similar chunk vectors are returned. This catches semantic matches — chunks with similar meaning, even if the words are different.

BM25 retrieval (sparse)

BM25 is a classical information retrieval algorithm based on term frequency. It scores documents by how often the query's words appear, adjusted for document length. It's fast, requires no embeddings, and excels at finding exact names, dates, and technical terms that embeddings might miss.

Why combine them?

Neither retriever is perfect alone:

Query type	Vector	BM25
"documents about campus safety"	Good — captures meaning	Decent — matches "safety"
"Dr. Rodriguez"	Weak — embeds as "person" concept	Strong — matches exact name
"feelings of joy and accomplishment"	Strong — semantic match	Weak — might miss synonyms like "pride"
"October 2020 announcement"	Moderate	Strong — matches exact date

By retrieving candidates from both and merging them, we get a broader candidate pool that covers both semantic and lexical matches.

Cross-encoder re-ranking

The merged candidates might number 30–40 chunks. We don't want to send all of them to the LLM — that wastes context and dilutes quality. A cross-encoder solves this by scoring each (query, chunk) pair directly.

Unlike the bi-encoder embedding model (which encodes query and chunk separately), a cross-encoder reads the query and chunk together and produces a relevance score. This is more accurate but slower — which is why we use it as a second stage on a small candidate set, not on the entire corpus.

We use cross-encoder/ms-marco-MiniLM-L-12-v2 to re-rank the merged candidates down to the top 15 before passing them to the LLM.

2. Setup

Prerequisites

Everything from Part III, plus a few additional packages:

pip install llama-index-retrievers-bm25 nltk

A requirements.txt is provided with the full set of dependencies:

pip install -r requirements.txt

The cross-encoder model (cross-encoder/ms-marco-MiniLM-L-12-v2) will download automatically on first use via sentence-transformers. It is small (~130 MB).

Make sure ollama is running and command-r7b is available:

ollama pull command-r7b

3. Building the vector store

The build_store.py script works like the one in Part III, with a few differences:

Smaller chunks: 256 tokens (vs. 500 in Part III) with 25 tokens of overlap
Incremental updates: by default, it only re-indexes new or modified files
Full rebuild: use --rebuild to start from scratch

python build_store.py --rebuild

Or for incremental updates after adding new files:

python build_store.py

Mode: incremental update
Loading existing index from ./store...
Index contains 42 documents
Data directory contains 44 files

  New:       2
  Modified:  0
  Deleted:   0
  Unchanged: 42

Indexing 2 file(s)...
Index updated and saved to ./store

Why smaller chunks?

In Part III we used 500-token chunks. Here we use 256. Smaller chunks are more precise — each one represents a more focused piece of text. With a re-ranker to sort them, precision matters more than capturing broad context in a single chunk. The tradeoff: you get more chunks to search through, and each one has less surrounding context.

Exercise 1: Rebuild the store with different chunk sizes (128, 256, 512, 1024). How does the number of chunks change? How does it affect retrieval quality?

4. Querying with hybrid retrieval

The query_hybrid.py script implements the full hybrid pipeline:

python query_hybrid.py "Find documents about campus safety"

The output shows retrieval statistics before the LLM response:

Query: Find documents about campus safety
Vector: 20, BM25: 20, overlap: 8, merged: 32, re-ranked to: 15

Response:
...

This tells you:

20 candidates came from vector similarity
20 came from BM25
8 were found by both (overlap)
32 unique candidates after merging
Re-ranked down to 15 for the LLM

Exercise 2: Run the same query using Part III's query.py (pure vector retrieval) and this tutorial's query_hybrid.py. Compare the source documents listed. Did hybrid retrieval find anything that pure vector missed?

5. Raw retrieval without an LLM

Sometimes you want to see exactly what the retrieval pipeline found, without the LLM summarizing or rephrasing. The retrieve.py script runs the same hybrid retrieval and re-ranking, but outputs the raw chunk text instead of passing it to an LLM:

python retrieve.py "Dr. Rodriguez"

Query: Dr. Rodriguez
Vector: 20, BM25: 20, overlap: 3, merged: 37, re-ranked to: 15
  vector-only: 17, bm25-only: 17, both: 3

================================================================================
=== [1] 2024_08_26_100859.txt  (score: 0.847)  [bm25-only]
================================================================================
Dr. Rodriguez spoke at the opening ceremony, emphasizing the
university's commitment to inclusive excellence...

================================================================================
=== [2] 2023_10_12_155349.txt  (score: 0.712)  [vector+bm25]
================================================================================
...

Each chunk is annotated with its source: vector-only, bm25-only, or vector+bm25. This lets you see which retriever nominated each result.

This is invaluable for debugging. If your LLM response seems off, check the raw retrieval first — the problem is often in what was retrieved, not how the LLM synthesized it.

Exercise 3: Run retrieve.py with a query that includes a specific name or date. How many of the top results are bm25-only? What would have been missed with pure vector retrieval?

6. Keyword search

For a complementary approach, search_keywords.py does pure keyword matching with no embeddings at all. It uses NLTK part-of-speech tagging to extract meaningful terms from your query, then searches the raw text files with regex:

python search_keywords.py "Hurricane Sandy recovery efforts"

Query: Hurricane Sandy recovery efforts
Extracted terms: hurricane sandy, recovery, efforts

Found 12 matches across 3 files

============================================================
--- 2012_11_02_164248.txt  (5 matches) ---
============================================================
  >>> 12: Hurricane Sandy has caused significant damage to our campus...
  ...

This is a fallback when you know exactly what you're looking for and don't need semantic matching. It's also fast — no models, no vector store needed.

Exercise 4: Compare the results of search_keywords.py, retrieve.py, and query_hybrid.py on the same query. When is each approach most useful?

7. Comparing the three query modes

Script	Method	Uses LLM?	Best for
`query_hybrid.py`	Hybrid (vector + BM25) + re-rank + LLM	Yes	Synthesized answers from documents
`retrieve.py`	Hybrid (vector + BM25) + re-rank	No	Inspecting raw retrieval results
`search_keywords.py`	POS-tagged keyword matching	No	Finding exact names, dates, terms

8. Exercises

Exercise 5: The hybrid retrieval uses VECTOR_TOP_K=20 and BM25_TOP_K=20. Experiment with different values. What happens if you set BM25 to 0 (effectively disabling it)? What about setting vector to 0?

Exercise 6: Change the re-ranker's RERANK_TOP_N from 15 to 5. How does this affect response quality? What about 30?

Exercise 7: Modify the prompt in query_hybrid.py. Try asking the model to respond as a specific persona, or to format the output differently (e.g., as a timeline, or as bullet points).

Exercise 8: Build this system over your own document collection — class notes, research papers, or a downloaded text corpus. Which retrieval mode works best for your documents?

Additional resources and references

LlamaIndex

Documentation: https://docs.llamaindex.ai/en/stable/
BM25 retriever: https://docs.llamaindex.ai/en/stable/examples/retrievers/bm25_retriever/

Models

Ollama: https://ollama.com
Huggingface models: https://huggingface.co/models

Models used in this tutorial

Model	Type	Role	Source
`command-r7b`	LLM (RAG-optimized)	Response generation	`ollama pull command-r7b`
`BAAI/bge-large-en-v1.5`	Embedding (1024-dim)	Text -> vector encoding	Huggingface (auto-downloaded)
`cross-encoder/ms-marco-MiniLM-L-12-v2`	Cross-encoder	Re-ranking candidates	Huggingface (auto-downloaded)

11 KiB Raw Permalink Blame History Unescape Escape