Initial commit: LLM workshop materials
Five modules covering nanoGPT, Ollama, RAG, semantic search, and neural networks. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
commit
1604671d36
56 changed files with 5577 additions and 0 deletions
276
04-semantic-search/README.md
Normal file
276
04-semantic-search/README.md
Normal file
|
|
@ -0,0 +1,276 @@
|
|||
# Large Language Models Part IV: Advanced Retrieval and Semantic Search
|
||||
|
||||
**CHEG 667-013 — Chemical Engineering with Computers**
|
||||
Department of Chemical and Biomolecular Engineering, University of Delaware
|
||||
|
||||
---
|
||||
|
||||
## Key idea
|
||||
|
||||
Build a more effective search system by combining multiple retrieval strategies and re-ranking results.
|
||||
|
||||
## Key goals
|
||||
|
||||
- Understand why simple vector search sometimes misses relevant results
|
||||
- Combine vector similarity with keyword matching (hybrid retrieval)
|
||||
- Use a cross-encoder to re-rank candidates
|
||||
- Compare LLM-synthesized answers with raw chunk retrieval
|
||||
|
||||
---
|
||||
|
||||
> This is an advanced topic that builds on Part III (RAG). Make sure you are comfortable with building a vector store and querying it before proceeding.
|
||||
|
||||
In Part III, we built a RAG system that embedded documents, retrieved the most similar chunks, and passed them to an LLM. That pipeline works well for many queries — but it has blind spots.
|
||||
|
||||
Consider searching for a specific person's name, a date, or a technical term. Vector embeddings capture *meaning*, not exact strings. A query for "Dr. Rodriguez" might retrieve chunks about "faculty" or "professors" instead of chunks that literally contain the name. Similarly, a query about "October 2020" might return chunks about autumn events in general.
|
||||
|
||||
This tutorial introduces three improvements:
|
||||
|
||||
1. **Hybrid retrieval** — combine vector similarity (good at meaning) with BM25 keyword matching (good at exact terms)
|
||||
2. **Cross-encoder re-ranking** — use a second model to score each (query, chunk) pair more carefully
|
||||
3. **Raw retrieval mode** — inspect what the pipeline retrieves *before* the LLM sees it
|
||||
|
||||
The result is a more effective search system that catches both semantic matches and exact-term matches.
|
||||
|
||||
|
||||
## 1. How hybrid retrieval works
|
||||
|
||||
In Part III, our pipeline was:
|
||||
|
||||
```
|
||||
Query → Embed → Vector similarity (top 15) → LLM → Response
|
||||
```
|
||||
|
||||
The improved pipeline is:
|
||||
|
||||
```
|
||||
Query → Embed ──→ Vector similarity (top 20) ──┐
|
||||
├─→ Merge & deduplicate → Cross-encoder re-rank (top 15) → LLM → Response
|
||||
Query → Tokenize → BM25 term matching (top 20) ┘
|
||||
```
|
||||
|
||||
### Vector retrieval (dense)
|
||||
|
||||
This is what we used in Part III. The query is embedded into a vector, and the most similar chunk vectors are returned. This catches *semantic* matches — chunks with similar meaning, even if the words are different.
|
||||
|
||||
### BM25 retrieval (sparse)
|
||||
|
||||
BM25 is a classical information retrieval algorithm based on term frequency. It scores documents by how often the query's words appear, adjusted for document length. It's fast, requires no embeddings, and excels at finding exact names, dates, and technical terms that embeddings might miss.
|
||||
|
||||
### Why combine them?
|
||||
|
||||
Neither retriever is perfect alone:
|
||||
|
||||
| Query type | Vector | BM25 |
|
||||
|------------|--------|------|
|
||||
| "documents about campus safety" | Good — captures meaning | Decent — matches "safety" |
|
||||
| "Dr. Rodriguez" | Weak — embeds as "person" concept | Strong — matches exact name |
|
||||
| "feelings of joy and accomplishment" | Strong — semantic match | Weak — might miss synonyms like "pride" |
|
||||
| "October 2020 announcement" | Moderate | Strong — matches exact date |
|
||||
|
||||
By retrieving candidates from *both* and merging them, we get a broader candidate pool that covers both semantic and lexical matches.
|
||||
|
||||
### Cross-encoder re-ranking
|
||||
|
||||
The merged candidates might number 30–40 chunks. We don't want to send all of them to the LLM — that wastes context and dilutes quality. A **cross-encoder** solves this by scoring each (query, chunk) pair directly.
|
||||
|
||||
Unlike the bi-encoder embedding model (which encodes query and chunk separately), a cross-encoder reads the query and chunk *together* and produces a relevance score. This is more accurate but slower — which is why we use it as a second stage on a small candidate set, not on the entire corpus.
|
||||
|
||||
We use `cross-encoder/ms-marco-MiniLM-L-12-v2` to re-rank the merged candidates down to the top 15 before passing them to the LLM.
|
||||
|
||||
|
||||
## 2. Setup
|
||||
|
||||
### Prerequisites
|
||||
|
||||
Everything from Part III, plus a few additional packages:
|
||||
|
||||
```bash
|
||||
pip install llama-index-retrievers-bm25 nltk
|
||||
```
|
||||
|
||||
A `requirements.txt` is provided with the full set of dependencies:
|
||||
|
||||
```bash
|
||||
pip install -r requirements.txt
|
||||
```
|
||||
|
||||
The cross-encoder model (`cross-encoder/ms-marco-MiniLM-L-12-v2`) will download automatically on first use via `sentence-transformers`. It is small (~130 MB).
|
||||
|
||||
Make sure `ollama` is running and `command-r7b` is available:
|
||||
|
||||
```bash
|
||||
ollama pull command-r7b
|
||||
```
|
||||
|
||||
|
||||
## 3. Building the vector store
|
||||
|
||||
The `build_store.py` script works like the one in Part III, with a few differences:
|
||||
|
||||
- **Smaller chunks**: 256 tokens (vs. 500 in Part III) with 25 tokens of overlap
|
||||
- **Incremental updates**: by default, it only re-indexes new or modified files
|
||||
- **Full rebuild**: use `--rebuild` to start from scratch
|
||||
|
||||
```bash
|
||||
python build_store.py --rebuild
|
||||
```
|
||||
|
||||
Or for incremental updates after adding new files:
|
||||
|
||||
```bash
|
||||
python build_store.py
|
||||
```
|
||||
|
||||
```
|
||||
Mode: incremental update
|
||||
Loading existing index from ./store...
|
||||
Index contains 42 documents
|
||||
Data directory contains 44 files
|
||||
|
||||
New: 2
|
||||
Modified: 0
|
||||
Deleted: 0
|
||||
Unchanged: 42
|
||||
|
||||
Indexing 2 file(s)...
|
||||
Index updated and saved to ./store
|
||||
```
|
||||
|
||||
### Why smaller chunks?
|
||||
|
||||
In Part III we used 500-token chunks. Here we use 256. Smaller chunks are more precise — each one represents a more focused piece of text. With a re-ranker to sort them, precision matters more than capturing broad context in a single chunk. The tradeoff: you get more chunks to search through, and each one has less surrounding context.
|
||||
|
||||
> **Exercise 1:** Rebuild the store with different chunk sizes (128, 256, 512, 1024). How does the number of chunks change? How does it affect retrieval quality?
|
||||
|
||||
|
||||
## 4. Querying with hybrid retrieval
|
||||
|
||||
The `query_hybrid.py` script implements the full hybrid pipeline:
|
||||
|
||||
```bash
|
||||
python query_hybrid.py "Find documents about campus safety"
|
||||
```
|
||||
|
||||
The output shows retrieval statistics before the LLM response:
|
||||
|
||||
```
|
||||
Query: Find documents about campus safety
|
||||
Vector: 20, BM25: 20, overlap: 8, merged: 32, re-ranked to: 15
|
||||
|
||||
Response:
|
||||
...
|
||||
```
|
||||
|
||||
This tells you:
|
||||
- 20 candidates came from vector similarity
|
||||
- 20 came from BM25
|
||||
- 8 were found by both (overlap)
|
||||
- 32 unique candidates after merging
|
||||
- Re-ranked down to 15 for the LLM
|
||||
|
||||
> **Exercise 2:** Run the same query using Part III's `query.py` (pure vector retrieval) and this tutorial's `query_hybrid.py`. Compare the source documents listed. Did hybrid retrieval find anything that pure vector missed?
|
||||
|
||||
|
||||
## 5. Raw retrieval without an LLM
|
||||
|
||||
Sometimes you want to see *exactly* what the retrieval pipeline found, without the LLM summarizing or rephrasing. The `retrieve.py` script runs the same hybrid retrieval and re-ranking, but outputs the raw chunk text instead of passing it to an LLM:
|
||||
|
||||
```bash
|
||||
python retrieve.py "Dr. Rodriguez"
|
||||
```
|
||||
|
||||
```
|
||||
Query: Dr. Rodriguez
|
||||
Vector: 20, BM25: 20, overlap: 3, merged: 37, re-ranked to: 15
|
||||
vector-only: 17, bm25-only: 17, both: 3
|
||||
|
||||
================================================================================
|
||||
=== [1] 2024_08_26_100859.txt (score: 0.847) [bm25-only]
|
||||
================================================================================
|
||||
Dr. Rodriguez spoke at the opening ceremony, emphasizing the
|
||||
university's commitment to inclusive excellence...
|
||||
|
||||
================================================================================
|
||||
=== [2] 2023_10_12_155349.txt (score: 0.712) [vector+bm25]
|
||||
================================================================================
|
||||
...
|
||||
```
|
||||
|
||||
Each chunk is annotated with its source: `vector-only`, `bm25-only`, or `vector+bm25`. This lets you see which retriever nominated each result.
|
||||
|
||||
This is invaluable for debugging. If your LLM response seems off, check the raw retrieval first — the problem is often in *what* was retrieved, not how the LLM synthesized it.
|
||||
|
||||
> **Exercise 3:** Run `retrieve.py` with a query that includes a specific name or date. How many of the top results are `bm25-only`? What would have been missed with pure vector retrieval?
|
||||
|
||||
|
||||
## 6. Keyword search
|
||||
|
||||
For a complementary approach, `search_keywords.py` does pure keyword matching with no embeddings at all. It uses NLTK part-of-speech tagging to extract meaningful terms from your query, then searches the raw text files with regex:
|
||||
|
||||
```bash
|
||||
python search_keywords.py "Hurricane Sandy recovery efforts"
|
||||
```
|
||||
|
||||
```
|
||||
Query: Hurricane Sandy recovery efforts
|
||||
Extracted terms: hurricane sandy, recovery, efforts
|
||||
|
||||
Found 12 matches across 3 files
|
||||
|
||||
============================================================
|
||||
--- 2012_11_02_164248.txt (5 matches) ---
|
||||
============================================================
|
||||
>>> 12: Hurricane Sandy has caused significant damage to our campus...
|
||||
...
|
||||
```
|
||||
|
||||
This is a fallback when you know exactly what you're looking for and don't need semantic matching. It's also fast — no models, no vector store needed.
|
||||
|
||||
> **Exercise 4:** Compare the results of `search_keywords.py`, `retrieve.py`, and `query_hybrid.py` on the same query. When is each approach most useful?
|
||||
|
||||
|
||||
## 7. Comparing the three query modes
|
||||
|
||||
| Script | Method | Uses LLM? | Best for |
|
||||
|--------|--------|-----------|----------|
|
||||
| `query_hybrid.py` | Hybrid (vector + BM25) + re-rank + LLM | Yes | Synthesized answers from documents |
|
||||
| `retrieve.py` | Hybrid (vector + BM25) + re-rank | No | Inspecting raw retrieval results |
|
||||
| `search_keywords.py` | POS-tagged keyword matching | No | Finding exact names, dates, terms |
|
||||
|
||||
|
||||
## 8. Exercises
|
||||
|
||||
> **Exercise 5:** The hybrid retrieval uses `VECTOR_TOP_K=20` and `BM25_TOP_K=20`. Experiment with different values. What happens if you set BM25 to 0 (effectively disabling it)? What about setting vector to 0?
|
||||
|
||||
> **Exercise 6:** Change the re-ranker's `RERANK_TOP_N` from 15 to 5. How does this affect response quality? What about 30?
|
||||
|
||||
> **Exercise 7:** Modify the prompt in `query_hybrid.py`. Try asking the model to respond as a specific persona, or to format the output differently (e.g., as a timeline, or as bullet points).
|
||||
|
||||
> **Exercise 8:** Build this system over your own document collection — class notes, research papers, or a downloaded text corpus. Which retrieval mode works best for your documents?
|
||||
|
||||
|
||||
## Additional resources and references
|
||||
|
||||
### LlamaIndex
|
||||
|
||||
- Documentation: https://docs.llamaindex.ai/en/stable/
|
||||
- BM25 retriever: https://docs.llamaindex.ai/en/stable/examples/retrievers/bm25_retriever/
|
||||
|
||||
### Models
|
||||
|
||||
- Ollama: https://ollama.com
|
||||
- Huggingface models: https://huggingface.co/models
|
||||
|
||||
#### Models used in this tutorial
|
||||
|
||||
| Model | Type | Role | Source |
|
||||
|-------|------|------|--------|
|
||||
| `command-r7b` | LLM (RAG-optimized) | Response generation | `ollama pull command-r7b` |
|
||||
| `BAAI/bge-large-en-v1.5` | Embedding (1024-dim) | Text -> vector encoding | Huggingface (auto-downloaded) |
|
||||
| `cross-encoder/ms-marco-MiniLM-L-12-v2` | Cross-encoder | Re-ranking candidates | Huggingface (auto-downloaded) |
|
||||
|
||||
### Further reading
|
||||
|
||||
- Robertson & Zaragoza, *The Probabilistic Relevance Framework: BM25 and Beyond* (2009) — the theory behind BM25
|
||||
- Nogueira & Cho, *Passage Re-ranking with BERT* (2019) — cross-encoder re-ranking applied to information retrieval
|
||||
193
04-semantic-search/build_store.py
Normal file
193
04-semantic-search/build_store.py
Normal file
|
|
@ -0,0 +1,193 @@
|
|||
# build_store.py
|
||||
#
|
||||
# Build or update the vector store from journal entries in ./data.
|
||||
#
|
||||
# Default mode (incremental): loads the existing index and adds only
|
||||
# new or modified files. Use --rebuild for a full rebuild from scratch.
|
||||
#
|
||||
# January 2026
|
||||
# E. M. Furst
|
||||
# Used Sonnet 4.5 to suggest changes; Opus 4.6 for incremental update
|
||||
|
||||
from llama_index.core import (
|
||||
SimpleDirectoryReader,
|
||||
StorageContext,
|
||||
VectorStoreIndex,
|
||||
load_index_from_storage,
|
||||
Settings,
|
||||
)
|
||||
from pathlib import Path
|
||||
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
|
||||
from llama_index.core.node_parser import SentenceSplitter
|
||||
import argparse
|
||||
import datetime
|
||||
import os
|
||||
import time
|
||||
|
||||
# Shared constants
|
||||
DATA_DIR = Path("./data")
|
||||
PERSIST_DIR = "./store"
|
||||
EMBED_MODEL_NAME = "BAAI/bge-large-en-v1.5"
|
||||
CHUNK_SIZE = 256
|
||||
CHUNK_OVERLAP = 25
|
||||
|
||||
|
||||
def get_text_splitter():
|
||||
return SentenceSplitter(
|
||||
chunk_size=CHUNK_SIZE,
|
||||
chunk_overlap=CHUNK_OVERLAP,
|
||||
paragraph_separator="\n\n",
|
||||
)
|
||||
|
||||
|
||||
def rebuild():
|
||||
"""Full rebuild: delete and recreate the vector store from scratch."""
|
||||
if not DATA_DIR.exists():
|
||||
raise FileNotFoundError(f"Data directory not found: {DATA_DIR.absolute()}")
|
||||
|
||||
print(f"Loading documents from {DATA_DIR.absolute()}...")
|
||||
documents = SimpleDirectoryReader(str(DATA_DIR)).load_data()
|
||||
|
||||
if not documents:
|
||||
raise ValueError("No documents found in data directory")
|
||||
|
||||
print(f"Loaded {len(documents)} document(s)")
|
||||
|
||||
print("Building vector index...")
|
||||
index = VectorStoreIndex.from_documents(
|
||||
documents,
|
||||
transformations=[get_text_splitter()],
|
||||
show_progress=True,
|
||||
)
|
||||
|
||||
index.storage_context.persist(persist_dir=PERSIST_DIR)
|
||||
print(f"Index built and saved to {PERSIST_DIR}")
|
||||
|
||||
|
||||
def update():
|
||||
"""Incremental update: add new files, re-index modified files, remove deleted files."""
|
||||
if not DATA_DIR.exists():
|
||||
raise FileNotFoundError(f"Data directory not found: {DATA_DIR.absolute()}")
|
||||
|
||||
# Load existing index
|
||||
print(f"Loading existing index from {PERSIST_DIR}...")
|
||||
storage_context = StorageContext.from_defaults(persist_dir=PERSIST_DIR)
|
||||
index = load_index_from_storage(storage_context)
|
||||
|
||||
# Set transformations so index.insert() chunks correctly
|
||||
Settings.transformations = [get_text_splitter()]
|
||||
|
||||
# Build lookup of indexed files: file_name -> (ref_doc_id, metadata)
|
||||
all_ref_docs = index.docstore.get_all_ref_doc_info()
|
||||
indexed = {}
|
||||
for ref_id, info in all_ref_docs.items():
|
||||
fname = info.metadata.get("file_name")
|
||||
if fname:
|
||||
indexed[fname] = (ref_id, info.metadata)
|
||||
|
||||
print(f"Index contains {len(indexed)} documents")
|
||||
|
||||
# Scan current files on disk
|
||||
disk_files = {f.name: f for f in sorted(DATA_DIR.glob("*.txt"))}
|
||||
print(f"Data directory contains {len(disk_files)} files")
|
||||
|
||||
# Classify files
|
||||
new_files = []
|
||||
modified_files = []
|
||||
deleted_files = []
|
||||
unchanged = 0
|
||||
|
||||
for fname, fpath in disk_files.items():
|
||||
if fname not in indexed:
|
||||
new_files.append(fpath)
|
||||
else:
|
||||
ref_id, meta = indexed[fname]
|
||||
# Compare file size and modification date
|
||||
stat = fpath.stat()
|
||||
disk_size = stat.st_size
|
||||
# Must use UTC to match SimpleDirectoryReader's date format
|
||||
disk_mdate = datetime.datetime.fromtimestamp(
|
||||
stat.st_mtime, tz=datetime.timezone.utc
|
||||
).strftime("%Y-%m-%d")
|
||||
|
||||
stored_size = meta.get("file_size")
|
||||
stored_mdate = meta.get("last_modified_date")
|
||||
|
||||
if disk_size != stored_size or disk_mdate != stored_mdate:
|
||||
modified_files.append((fpath, ref_id))
|
||||
else:
|
||||
unchanged += 1
|
||||
|
||||
for fname, (ref_id, meta) in indexed.items():
|
||||
if fname not in disk_files:
|
||||
deleted_files.append((fname, ref_id))
|
||||
|
||||
# Report
|
||||
print(f"\n New: {len(new_files)}")
|
||||
print(f" Modified: {len(modified_files)}")
|
||||
print(f" Deleted: {len(deleted_files)}")
|
||||
print(f" Unchanged: {unchanged}")
|
||||
|
||||
if not new_files and not modified_files and not deleted_files:
|
||||
print("\nNothing to do.")
|
||||
return
|
||||
|
||||
# Process deletions (including modified files that need re-indexing)
|
||||
for fname, ref_id in deleted_files:
|
||||
print(f" Removing {fname}")
|
||||
index.delete_ref_doc(ref_id, delete_from_docstore=True)
|
||||
|
||||
for fpath, ref_id in modified_files:
|
||||
print(f" Re-indexing {fpath.name} (modified)")
|
||||
index.delete_ref_doc(ref_id, delete_from_docstore=True)
|
||||
|
||||
# Process additions (new files + modified files)
|
||||
files_to_add = new_files + [fpath for fpath, _ in modified_files]
|
||||
if files_to_add:
|
||||
print(f"\nIndexing {len(files_to_add)} file(s)...")
|
||||
# Use "./" prefix to match paths from full build (pathlib strips it)
|
||||
docs = SimpleDirectoryReader(
|
||||
input_files=[f"./{f}" for f in files_to_add]
|
||||
).load_data()
|
||||
for doc in docs:
|
||||
index.insert(doc)
|
||||
|
||||
# Persist
|
||||
index.storage_context.persist(persist_dir=PERSIST_DIR)
|
||||
print(f"\nIndex updated and saved to {PERSIST_DIR}")
|
||||
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser(
|
||||
description="Build or update the vector store from journal entries."
|
||||
)
|
||||
parser.add_argument(
|
||||
"--rebuild",
|
||||
action="store_true",
|
||||
help="Full rebuild from scratch (default: incremental update)",
|
||||
)
|
||||
args = parser.parse_args()
|
||||
|
||||
# Configure embedding model
|
||||
embed_model = HuggingFaceEmbedding(model_name=EMBED_MODEL_NAME)
|
||||
Settings.embed_model = embed_model
|
||||
|
||||
start = time.time()
|
||||
|
||||
if args.rebuild:
|
||||
print("Mode: full rebuild")
|
||||
rebuild()
|
||||
else:
|
||||
print("Mode: incremental update")
|
||||
if not Path(PERSIST_DIR).exists():
|
||||
print(f"No existing index at {PERSIST_DIR}, doing full rebuild.")
|
||||
rebuild()
|
||||
else:
|
||||
update()
|
||||
|
||||
elapsed = time.time() - start
|
||||
print(f"Done in {elapsed:.1f}s")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
80
04-semantic-search/data/2012_11_02_164248.txt
Normal file
80
04-semantic-search/data/2012_11_02_164248.txt
Normal file
|
|
@ -0,0 +1,80 @@
|
|||
Subject: [UDEL-ALL-2128] Hurricane Sandy
|
||||
Date: 2012_11_02_164248
|
||||
|
||||
To the University of Delaware community:
|
||||
|
||||
We have much to be thankful for this week at the University of Delaware
|
||||
as we were spared the full force of Hurricane Sandy. Even as we breathe
|
||||
a sigh of relief and return to our normal activities, we are mindful of
|
||||
the many, many people in this region -- some of our students among them
|
||||
-- who were not so lucky. Our thoughts and prayers go out to them as
|
||||
they rebuild their communities.
|
||||
|
||||
The potential impact of Sandy was a major concern for UD, with its
|
||||
thousands of people and 430+ buildings on 2,000 acres throughout the
|
||||
state. Many members of our University community worked hard over the
|
||||
last several days to help us weather this "Storm of the Century."
|
||||
|
||||
Preparation and practice paid off as our emergency response team, led
|
||||
by the Office of Campus and Public Safety, began assessing the
|
||||
situation late last week and taking steps to ensure the safety of our
|
||||
people and facilities. When the storm came, the campus suffered only
|
||||
minor damage: wind-driven water getting into buildings through roofs,
|
||||
walls and foundations; very minimal power loss, with a couple of
|
||||
residential properties without power for only a few hours, thanks to
|
||||
quick repair from the City of Newark; and only three trees knocked down
|
||||
and destroyed, along with a lot of leaves and branches to clean up. The
|
||||
Georgetown research facilities were fortunate to sustain only minor
|
||||
leaks and flooding. The hardest hit area was the Lewes campus, which
|
||||
had flooding on its grounds but minimal damage to buildings.
|
||||
|
||||
Throughout this time, the University's greatest asset continued to be
|
||||
its people -- staff members from a variety of units working as a team.
|
||||
A command center brought together representatives from across UD so
|
||||
that issues could be responded to immediately. Staffed around the
|
||||
clock, the center included Housing, Public Safety, Residence Life,
|
||||
Environmental Health and Safety, Facilities and Auxiliary Services,
|
||||
Emergency Management, and Communications and Marketing.
|
||||
|
||||
The dedication of UD's employees and students was evident everywhere:
|
||||
Dining Services staff, faced with reduced numbers and limited
|
||||
deliveries, kept students fed, and supported employees who worked
|
||||
during the crisis; Residence Life staff and resident assistants made
|
||||
sure students who remained on campus had up-to-date information and
|
||||
supplies; staff in Student Health Services kept Laurel Hall open to
|
||||
respond to student health needs; Human Resources staff worked over the
|
||||
weekend to ensure that payroll was processed ahead of time; UD Police
|
||||
officers were on patrol and responding to issues as they arose; the UD
|
||||
Emergency Care Unit was at the ready; staff in Environmental Health and
|
||||
Safety aided in the safe shutdown of UD laboratories and monitored fire
|
||||
safety issues; Facilities staff continue to clean up debris left in
|
||||
Sandy's wake and repair damage to buildings; faculty are working with
|
||||
students to make up lost class time.
|
||||
|
||||
Our UD Alert system served as an excellent tool for keeping students,
|
||||
parents and employees informed about the storm's implications for UD,
|
||||
and the University's homepage was the repository for the most current
|
||||
information and lists of events and activities that were canceled or
|
||||
rescheduled. Through the University's accounts on Facebook and Twitter,
|
||||
staff answered questions and addressed concerns, and faculty and staff
|
||||
across the campus fielded phone calls and emails.
|
||||
|
||||
In short, a stellar job all around.
|
||||
|
||||
On behalf of the students, families and employees who benefited from
|
||||
these efforts, I thank everyone for their dedication and service to the
|
||||
people of UD.
|
||||
|
||||
Sincerely,
|
||||
|
||||
Patrick T. Harker
|
||||
President
|
||||
|
||||
|
||||
::::::::::::::::::::::::::::::::::::::::::: UD P.O. Box ::
|
||||
UDEL-ALL-2128 mailing list
|
||||
|
||||
Online message archive
|
||||
and management at https://po-box.nss.udel.edu/
|
||||
::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
|
||||
|
||||
85
04-semantic-search/data/2017_05_16_123456.txt
Normal file
85
04-semantic-search/data/2017_05_16_123456.txt
Normal file
|
|
@ -0,0 +1,85 @@
|
|||
Subject: Employee Appreciation Week
|
||||
Date: 2017_05_16_123456
|
||||
|
||||
To the University of Delaware Community - President Dennis Assanis
|
||||
|
||||
|
||||
|
||||
/* Smartphones (landscape) ----------- */
|
||||
|
||||
@media only screen and (max-width: 568px) {
|
||||
img.logo {width:413px;}
|
||||
}
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
May 16, 2017
|
||||
|
||||
Dear colleague,
|
||||
|
||||
|
||||
Our first year together has been one of amazing accomplishments and exciting opportunities. At the heart of our success has been you — the University of Delaware’s exceptional faculty and staff. To thank you and celebrate everything you do, we are launching our first Employee Appreciation Week.
|
||||
|
||||
The full week of events includes:
|
||||
|
||||
|
||||
Monday, June 5—UDidIt Picnic
|
||||
|
||||
Tuesday, June 6—Self-Care Day
|
||||
Wednesday, June 7—UD Spirit Day
|
||||
Thursday, June 8—Flavors of UD
|
||||
|
||||
Friday, June 9—Employee Appreciation Night at the Blue Rocks
|
||||
|
||||
The week is a collaborative effort by Employee Health & Wellbeing and Human Resources. You can get all the details here.
|
||||
|
||||
We are dedicated to cultivating together an environment where employees are happy, healthy and continue to bring their best selves to work each day. The work you do benefits our students, our community and the world. I am truly grateful for your talents, skills, ideas and enduring commitment to the University.
|
||||
|
||||
Eleni and I hope you enjoy Employee Appreciation Week with your team and your family, and we look forward to seeing you at the many events.
|
||||
|
||||
Best,
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
Dennis AssanisPresident
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
University of Delaware • Newark, DE 19716 • USA • (302) 831-2792 • www.udel.edu/president
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
img { display: block !important; }
|
||||
79
04-semantic-search/data/2018_05_21_110335.txt
Normal file
79
04-semantic-search/data/2018_05_21_110335.txt
Normal file
|
|
@ -0,0 +1,79 @@
|
|||
Subject: Robin Morgan named UD's 11th provost
|
||||
Date: 2018_05_21_110335
|
||||
|
||||
Robin Morgan Appointed Provost - University of Delaware
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
May 21, 2018
|
||||
|
||||
|
||||
Dear UD Community,
|
||||
|
||||
I am pleased to announce that, after a highly competitive national search, I have appointed Robin Morgan as the University of Delaware’s new provost, effective July 1. She will become the University of Delaware’s 11th provost, and the first woman to serve in this role in a permanent capacity since the position was created at UD in 1950.
|
||||
|
||||
Over the last seven months, Dr. Morgan already has assembled an impressive record as interim provost, most notably in her stewardship of new cluster hires among our faculty and her leadership as we move toward the creation of the graduate college.
|
||||
|
||||
Before working closely with her, I knew Dr. Morgan as a highly respected educator and scholar, but after watching her in action, I am equally impressed with her abilities to lead, inspire and effect change. Her energy, integrity, analytical mind, and innate knack for bringing people together, combined with her dedication and loyalty to UD, are great assets.
|
||||
|
||||
Dr. Morgan has a distinguished record of service to this University as a faculty member since 1985. After serving as acting dean of the College of Agriculture and Natural Resources for a year, she was named dean in 2002, serving in that role for 10 years, a period of significant growth and change for the college. From 2014-16, she served as acting chair of the Department of Biological Sciences, and she had been chair of the department from 2016 until her appointment as interim provost.
|
||||
|
||||
We will continue to benefit from Dr. Morgan’s deep knowledge of the University, her proven leadership across all aspects of teaching, research and administration, and her dedication to UD as she continues her career as provost.
|
||||
|
||||
I am looking forward to building on our close working relationship, and I am excited by all we will accomplish to take the University of Delaware forward. Please join me in congratulating her on this next chapter in her career.
|
||||
|
||||
|
||||
Sincerely,
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
Dennis AssanisPresident
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
University of Delaware • Newark, DE 19716 • USA • (302) 831-2111 • www.udel.edu/president
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
img { display: block !important; }
|
||||
77
04-semantic-search/data/2020_03_29_141635.txt
Normal file
77
04-semantic-search/data/2020_03_29_141635.txt
Normal file
|
|
@ -0,0 +1,77 @@
|
|||
Subject: Momentum and Resilience: Our UD Spring Semester Resumes
|
||||
Date: 2020_03_29_141635
|
||||
|
||||
A Message from President Dennis Assanis
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
Dear UD Community,
|
||||
|
||||
As the University of Delaware is ready to resume the spring semester tomorrow, March 30, I want to share with all of you a special message recorded from the office in my home. Thank you all for your support at this challenging time, particularly our faculty and staff for your Herculean efforts to convert our classes from face to face instruction to online teaching and learning.
|
||||
|
||||
Best of luck with the semester ahead. As we all work remotely, please stay healthy, and stay connected!
|
||||
|
||||
|
||||
|
||||
|
||||
Sincerely,
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
Dennis AssanisPresident
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
University of Delaware • Newark, DE 19716 • USA • (302) 831-2111 • udel.edu/president
|
||||
75
04-semantic-search/data/2023_09_19_085321.txt
Normal file
75
04-semantic-search/data/2023_09_19_085321.txt
Normal file
|
|
@ -0,0 +1,75 @@
|
|||
Subject: National Voter Registration Day: Get Involved
|
||||
Date: 2023_09_19_085321
|
||||
|
||||
National Voter Registration Day: Get Involved
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
September 19, 2023
|
||||
|
||||
Dear UD Community,
|
||||
|
||||
Do you want to make a difference in the world? Today is a good day to start.
|
||||
|
||||
This is National Voter Registration Day, an opportunity to make sure your voice will be heard in upcoming local, state and national elections. Voting is the most fundamental way that we engage in our democracy, effect change in society, work through our political differences and choose our leaders for the future. The voting rights we enjoy have been secured through the hard work and sacrifice of previous generations, and it is essential that everyone who is eligible to vote remains committed to preserving and exercising those rights.
|
||||
|
||||
At the University of Delaware, the Student Voting and Civic Engagement Committee — representing students, faculty and staff — is leading a non-partisan effort to encourage voting and help voters become better informed about the issues that matter to them. The Make It Count voter registration drive is scheduled for 2-6 p.m. today on The Green, with games, music and the opportunity to register through the TurboVote app, which also allows users to request an absentee ballot and sign up for election reminders. The committee is planning additional events this academic year to promote voting, education and civil discourse as the nation heads into the 2024 election season.
|
||||
|
||||
Being a Blue Hen means sharing a commitment to creating a better world. And being a registered, engaged and informed voter is one of the best ways for all of us to achieve that vision.
|
||||
|
||||
Sincerely,
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
Dennis AssanisPresident
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
University of Delaware • Newark, DE • udel.edu/president
|
||||
77
04-semantic-search/data/2023_10_12_155349.txt
Normal file
77
04-semantic-search/data/2023_10_12_155349.txt
Normal file
|
|
@ -0,0 +1,77 @@
|
|||
Subject: Affirming our position and purpose
|
||||
Date: 2023_10_12_155349
|
||||
|
||||
Affirming our position and purpose | A message from UD President Dennis Assanis
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
October 12, 2023
|
||||
|
||||
Dear UD Community,
|
||||
|
||||
Since my message yesterday, I have talked to many members of our community who — like me — are devastated and appalled by the terrorist attacks on Israel and the ongoing loss of life that has taken place in the Middle East.
|
||||
|
||||
I want to be sure that our position is very clear: We at the University of Delaware unequivocally condemn the horrific attacks by Hamas terrorists upon Israel that have shaken the world. The atrocities of crime, abduction, hostage-taking and mass murder targeted against Jewish civilians will forever remain a stain on human history. Our community’s foundation of civility and respect has been challenged to an unimaginable extent in light of the antisemitic brutalities that have been committed against innocent victims.
|
||||
|
||||
As your president, I wish words could calm the heartache and ease the fear and grief. Unfortunately, we all know that events as complicated and devastating as those taking place in the Middle East right now will continue to evolve. The longstanding humanitarian crisis needs to be acknowledged, and we should not equate the terrorist group Hamas with innocent Palestinian, Muslim and Arab people. The ensuing war-inflicted pain, suffering and death that continues to play out across the region, including Gaza, is heartbreaking for all.
|
||||
|
||||
We must remember that, first and foremost, UD is a place of learning. As we engage in difficult conversations about the longstanding conflicts in the Middle East, we should always strive to do so safely, with mutual respect and without bias or judgement. I encourage our students, faculty and staff to continues organizing events to educate and unite our community. Please seize these opportunities not only as individuals, but as members of a true community defined by the freedoms that we treasure so very deeply.
|
||||
|
||||
So, my message to you all is to have hope, to support each other, and to realize that the perspectives and feelings we are all experiencing right now — many of which uniquely connect to our personal backgrounds — matter. Please remember this as you walk across campus, sit in your next classroom, share experiences with other members of our community, or simply take time to reflect.
|
||||
|
||||
Respectfully,
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
Dennis AssanisPresident
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
University of Delaware • Newark, DE • udel.edu/president
|
||||
82
04-semantic-search/data/2024_08_26_100859.txt
Normal file
82
04-semantic-search/data/2024_08_26_100859.txt
Normal file
|
|
@ -0,0 +1,82 @@
|
|||
Subject: A warm welcome to our UD community!
|
||||
Date: 2024_08_26_100859
|
||||
|
||||
A warm welcome to our UD community!
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
August 26, 2024
|
||||
|
||||
Dear UD Community,
|
||||
|
||||
I love the beginning of every new academic year and the renewed energy and sense of anticipation that it brings to every member of our campus community. The large influx of new people and ideas that come along with each new start is truly invigorating. Whether you are a new or continuing student, faculty or staff member, on behalf of everyone in our community, I want to extend a very warm welcome to you and thank you for everything you contribute, individually and collectively, to make the University of Delaware such a unique place.
|
||||
|
||||
Students, your fresh perspectives, your passion for learning, and your dreams and aspirations for the boundless possibilities that lie ahead are inspiring. Faculty, your intellectual energy, your insights and expertise, and above all, your genuine interest in transferring and sharing your knowledge with all of us are the beating heart of our institution. And to all our staff, your hard work and dedicated talents provide the essential support and services to help ensure our students are successful in all their personal, academic and career pursuits.
|
||||
|
||||
Here at UD, our shared purpose is to cultivate learning, develop knowledge and foster the free exchange of ideas. The connections we make and the relationships we build help advance the mission of the University. Our focus on academic excellence in all fields of study and our opportunities for groundbreaking research rely on our endless curiosity, mutual respect and open mindedness. Together, we are stronger.
|
||||
|
||||
This sense of connection and belonging at UD is fundamental to our campus culture. Your willingness to hear and consider all voices and viewpoints is critical to shaping the vibrant and inclusive culture of our entire institution. Only when we commit to constructive growth, based on a foundation of civility and respect for ourselves and each other, can we realize true progress. Empowered by diverse perspectives, it is the opportunities to advance ideas that enrich learning and create positive impact in the world that unite all of us.
|
||||
|
||||
To celebrate the new semester and welcome our undergraduate Class of 2028, all members of our community are invited to attend the Twilight Induction ceremony tonight at 7:30 p.m. on the north side of Memorial Hall or online on Facebook Live.
|
||||
|
||||
As your President, I am so excited by all that we can accomplish together throughout this academic year. My wife, Eleni, and I wish you all the best at the start of this new semester and beyond. We look forward to meeting you on campus!
|
||||
|
||||
|
||||
Sincerely,
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
Dennis AssanisPresident
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
University of Delaware • Newark, DE • udel.edu
|
||||
80
04-semantic-search/data/2025_02_13_160414.txt
Normal file
80
04-semantic-search/data/2025_02_13_160414.txt
Normal file
|
|
@ -0,0 +1,80 @@
|
|||
Subject: UPDATE: Recent Executive Orders
|
||||
Date: 2025_02_13_160414
|
||||
|
||||
UPDATE: Recent Executive Orders | University of Delaware
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
Feb. 13, 2025
|
||||
|
||||
Dear UD Community,
|
||||
|
||||
I know many of you continue to experience disruption and anxiety stemming from the recent federal actions and executive orders regarding a multitude of issues — from research funding to education, human rights, and immigration among other areas. As I communicated to the University of Delaware community in my Jan. 28 campus message and my Feb. 3 comments to the Faculty Senate, we will do everything we can to minimize disruption to UD students, faculty and staff while remaining in compliance with federal law.
|
||||
|
||||
To support our community, we have created this resource page that will be updated regularly with information for UD students, faculty and staff regarding ongoing federal actions, directives and developments, including guidance in response to changing conditions. Also, this page from the Research Office contains specific guidance related to research projects and grants. In parallel, we will continue to advocate on behalf of the University’s interests regarding any impact that federal or state actions could have on our students, faculty and staff.
|
||||
|
||||
One example is our response this week related to the federal action to impose a 15 limit on reimbursements for indirect administrative costs (Facilities and Administrative, or F&A costs) for all National Institutes of Health (NIH) research grants. This immediate cut in funding would have a devastating impact on all biomedical, health and life science advances and human wellness, including here at UD. In response, the Delaware Attorney General filed a lawsuit jointly with 21 other state attorneys general. The University supported the Attorney General’s lawsuit by submitting a declaration detailing the impact of the NIH rate cap on the institution. Fortunately, the attorneys general were successful, and a temporary restraining order was granted on Monday. Further, the Association of Public and Land-grant Universities, the Association of American Universities, and the American Council on Education announced a similar lawsuit.
|
||||
|
||||
As we navigate this rapidly evolving landscape together, our values will continue to be at the heart of our community. We will continue to foster an atmosphere that promotes the free exchange of ideas and opinions; we will continue to welcome and value people of different backgrounds, perspectives and learning experiences; and we will continue to encourage respect and civility toward everyone.
|
||||
|
||||
Please know that my leadership team and I are here to help and support our community during this time. Feel free to submit any questions pertaining to these matters here, and we will do our best to add relevant information on the resource pages. I deeply appreciate your resilience and patience as we continue to work together to advance the important mission of our University.
|
||||
|
||||
|
||||
Sincerely,
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
Dennis AssanisPresident
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
University of Delaware • Newark, DE • udel.edu
|
||||
87
04-semantic-search/data/2025_04_29_230614.txt
Normal file
87
04-semantic-search/data/2025_04_29_230614.txt
Normal file
|
|
@ -0,0 +1,87 @@
|
|||
Subject: Extending condolences and offering support
|
||||
Date: 2025_04_29_230614
|
||||
|
||||
Extending condolences and offering support
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
April 29, 2025
|
||||
|
||||
Dear UD Community,
|
||||
|
||||
It is with a heavy heart that we share this information with you. Earlier today, a University of Delaware student died in a traffic accident on Main Street near campus, and several other people, including other UD students, suffered injuries. There is no ongoing threat to the University community.
|
||||
|
||||
University of Delaware Police are continuing to work with the Newark Police Department, which is actively investigating the incident. As a result, information is limited and the Newark Police Department is not releasing the victims’ names at this time, pending family notification.
|
||||
|
||||
This is a terrible tragedy for everyone in our UD community. We speak for the entire University in offering our condolences to the families, friends and classmates of the victims, and keep the other members of our community in our thoughts who may have witnessed the crash and its aftermath. The safety of our entire community remains our top priority, and we will continue to work with our partners in city and state government to address safety concerns around and on the UD campus.
|
||||
|
||||
As we all begin to cope with this traumatic incident, we encourage you to support one another and reach out for additional help from the UD resources listed below as needed.
|
||||
|
||||
Sincerely,
|
||||
|
||||
Dennis AssanisPresident
|
||||
|
||||
José-Luis RieraVice President for Student Life
|
||||
|
||||
|
||||
|
||||
Support and resources
|
||||
|
||||
|
||||
Center for Counseling and Student Development
|
||||
|
||||
Counselors and Student Life staff are available in Warner Hall 101 on Wednesday, April 30, from 9 a.m. to 3 p.m. for counseling services.
|
||||
|
||||
|
||||
TimelyCare — A virtual health and wellbeing platform available 24/7 for UD students
|
||||
Student Advocacy and Support — Available to assist students who need support navigating University resources or complex issues. Call 302-831-8939 or email studentsupport@udel.edu to schedule an appointment.
|
||||
ComPsych® GuidanceResources® — Mental health support for UD benefited employees. Access services through the link or call 877-527-4742 for support.
|
||||
Additional safety and wellness resources — Information about UD Police, Student Health Services and other services.
|
||||
Information about the UD Alert, the LiveSafe app and safety notification communication.
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
University of Delaware • Newark, DE • udel.edu
|
||||
76
04-semantic-search/data/2025_04_30_160615.txt
Normal file
76
04-semantic-search/data/2025_04_30_160615.txt
Normal file
|
|
@ -0,0 +1,76 @@
|
|||
Subject: Sharing our grief, enhancing safety
|
||||
Date: 2025_04_30_160615
|
||||
|
||||
Sharing our grief, enhancing safety
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
April 30, 2025
|
||||
|
||||
Dear UD Community,
|
||||
|
||||
Since last evening’s crash on Main Street that took the life of a University of Delaware graduate student (whose identity is being withheld at this time) and injured several others, we have been struggling to cope with the pain of this senseless tragedy. Throughout the UD community, we are all feeling the deep ache of loss, and we will continue to work through our grief together.
|
||||
|
||||
Today, Newark Police announced an arrest in connection with the crash, reiterating that there is no ongoing threat to the community.
|
||||
|
||||
Main Street is where we eat, shop and share our lives with our friends, families and classmates. Because it is part of the state’s roadway systems, we have been working with local and state officials this year, including our partners at Delaware Department of Transportation, to address traffic safety on and around Main Street. In the wake of this tragedy, we will reinforce and accelerate those efforts. We recognize there isn’t a simple solution, particularly when these tragedies involve actions taken by individuals that may not be stopped by changes to roadways or infrastructure. However, this incident underscores that our collective efforts must take on renewed urgency.
|
||||
|
||||
University leaders joined Delaware Attorney General Kathy Jennings and Newark Mayor Travis McDermott today for a press conference, at which we expressed our shared commitment to enhanced safety along Main Street. The University has pledged to continue these discussions through meetings with the offices of AG Jennings and Mayor McDermott, in addition to DelDOT, in the near future. The University remains committed to advancing meaningful solutions, while the University’s Division of Student Life and Graduate College are connecting with students about effective advocacy, civic engagement and partnerships in order to support these efforts.
|
||||
|
||||
We are also aware that members of the UD community may have witnessed the crash and its aftermath or have close relationships with the victims. We encourage everyone to become familiar with and use, as needed, the available University counseling and support resources that were shared in Tuesday evening’s message to the UD community. Counseling services are available at Warner Hall and through TimelyCare anytime, 24/7. Students with physical injuries or medical concerns relating to the incident can contact Student Health Services at 302-831-2226, Option 0, or visit Laurel Hall to meet with triage nurses available until 5 p.m. After hours, students can contact the Highmark Nurse line at 888-258-3428 or visit local urgent care centers (Newark Urgent Care at 324 E. Main Street, or ChristianaCare GoHealth at 550 S. College Avenue, Suite 115).
|
||||
|
||||
During this difficult time in our community, we all need to continue supporting and standing by one another as we move forward together.
|
||||
|
||||
Sincerely,
|
||||
|
||||
Dennis AssanisPresident
|
||||
|
||||
Laura CarlsonProvost
|
||||
|
||||
José-Luis RieraVice President for Student Life
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
University of Delaware • Newark, DE • udel.edu
|
||||
176
04-semantic-search/query_hybrid.py
Normal file
176
04-semantic-search/query_hybrid.py
Normal file
|
|
@ -0,0 +1,176 @@
|
|||
# query_hybrid.py
|
||||
# Hybrid retrieval: BM25 (sparse) + vector similarity (dense) + cross-encoder
|
||||
#
|
||||
# Combines two retrieval strategies to catch both exact term matches and
|
||||
# semantic similarity:
|
||||
# 1. Retrieve top-20 via vector similarity (bi-encoder, catches meaning)
|
||||
# 2. Retrieve top-20 via BM25 (term frequency, catches exact names/dates)
|
||||
# 3. Merge and deduplicate candidates by node ID
|
||||
# 4. Re-rank the union with a cross-encoder -> top-15
|
||||
# 5. Pass re-ranked chunks to LLM for synthesis
|
||||
#
|
||||
# The cross-encoder doesn't care where candidates came from -- it scores
|
||||
# each (query, chunk) pair on its own merits. BM25's job is just to
|
||||
# nominate candidates that vector similarity might miss.
|
||||
#
|
||||
# E.M.F. February 2026
|
||||
|
||||
# Environment vars must be set before importing huggingface/transformers
|
||||
# libraries, because huggingface_hub.constants evaluates HF_HUB_OFFLINE
|
||||
# at import time.
|
||||
import os
|
||||
os.environ["TOKENIZERS_PARALLELISM"] = "false"
|
||||
os.environ["SENTENCE_TRANSFORMERS_HOME"] = "./models"
|
||||
os.environ["HF_HUB_OFFLINE"] = "1"
|
||||
|
||||
from llama_index.core import (
|
||||
StorageContext,
|
||||
load_index_from_storage,
|
||||
Settings,
|
||||
get_response_synthesizer,
|
||||
)
|
||||
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
|
||||
from llama_index.llms.ollama import Ollama
|
||||
from llama_index.core.prompts import PromptTemplate
|
||||
from llama_index.core.postprocessor import SentenceTransformerRerank
|
||||
from llama_index.retrievers.bm25 import BM25Retriever
|
||||
import sys
|
||||
|
||||
#
|
||||
# Globals
|
||||
#
|
||||
|
||||
# Embedding model (must match build_store.py)
|
||||
EMBED_MODEL = HuggingFaceEmbedding(cache_folder="./models", model_name="BAAI/bge-large-en-v1.5", local_files_only=True)
|
||||
|
||||
# LLM model for generation
|
||||
LLM_MODEL = "command-r7b"
|
||||
|
||||
# Cross-encoder model for re-ranking (cached in ./models/)
|
||||
RERANK_MODEL = "cross-encoder/ms-marco-MiniLM-L-12-v2"
|
||||
RERANK_TOP_N = 15
|
||||
|
||||
# Retrieval parameters
|
||||
VECTOR_TOP_K = 20 # candidates from vector similarity
|
||||
BM25_TOP_K = 20 # candidates from BM25 term matching
|
||||
|
||||
#
|
||||
# Custom prompt -- same as v3
|
||||
#
|
||||
PROMPT = PromptTemplate(
|
||||
"""You are a precise research assistant analyzing excerpts from a personal journal collection.
|
||||
Every excerpt below has been selected and ranked for relevance to the query.
|
||||
|
||||
CONTEXT (ranked by relevance):
|
||||
{context_str}
|
||||
|
||||
QUERY:
|
||||
{query_str}
|
||||
|
||||
Instructions:
|
||||
- Answer ONLY using information explicitly present in the CONTEXT above
|
||||
- Examine ALL provided excerpts, not just the top few -- each one was selected for relevance
|
||||
- Be specific: quote or closely paraphrase key passages and cite their file names
|
||||
- When multiple files touch on the query, note what each one contributes
|
||||
- If the context doesn't contain enough information to answer fully, say so
|
||||
|
||||
Your response should:
|
||||
1. Directly answer the query, drawing on as many relevant excerpts as possible
|
||||
2. Reference specific files and their content (e.g., "In <filename>, ...")
|
||||
3. End with a list of all files that contributed to your answer, with a brief note on each
|
||||
|
||||
If the context is insufficient, explain what's missing."""
|
||||
)
|
||||
|
||||
|
||||
def main():
|
||||
# Configure LLM and embedding model
|
||||
# for local model using ollama
|
||||
# Note: Ollama temperature defaults to 0.8
|
||||
Settings.llm = Ollama(
|
||||
model=LLM_MODEL,
|
||||
temperature=0.3,
|
||||
request_timeout=360.0,
|
||||
context_window=8000,
|
||||
)
|
||||
|
||||
# Use OpenAI API:
|
||||
# from llama_index.llms.openai import OpenAI
|
||||
# Settings.llm = OpenAI(
|
||||
# model="gpt-4o-mini", # or "gpt-4o" for higher quality
|
||||
# temperature=0.3,
|
||||
# )
|
||||
|
||||
Settings.embed_model = EMBED_MODEL
|
||||
|
||||
|
||||
# Load persisted vector store
|
||||
storage_context = StorageContext.from_defaults(persist_dir="./store")
|
||||
index = load_index_from_storage(storage_context)
|
||||
|
||||
# --- Retrievers ---
|
||||
|
||||
# Vector retriever (dense: cosine similarity over embeddings)
|
||||
vector_retriever = index.as_retriever(similarity_top_k=VECTOR_TOP_K)
|
||||
|
||||
# BM25 retriever (sparse: term frequency scoring)
|
||||
bm25_retriever = BM25Retriever.from_defaults(
|
||||
index=index,
|
||||
similarity_top_k=BM25_TOP_K,
|
||||
)
|
||||
|
||||
# Cross-encoder re-ranker
|
||||
reranker = SentenceTransformerRerank(
|
||||
model=RERANK_MODEL,
|
||||
top_n=RERANK_TOP_N,
|
||||
)
|
||||
|
||||
# --- Query ---
|
||||
|
||||
if len(sys.argv) < 2:
|
||||
print("Usage: python query_hybrid_bm25_v4.py QUERY_TEXT")
|
||||
sys.exit(1)
|
||||
q = " ".join(sys.argv[1:])
|
||||
|
||||
# Retrieve from both sources
|
||||
vector_nodes = vector_retriever.retrieve(q)
|
||||
bm25_nodes = bm25_retriever.retrieve(q)
|
||||
|
||||
# Merge and deduplicate by node ID
|
||||
seen_ids = set()
|
||||
merged = []
|
||||
for node in vector_nodes + bm25_nodes:
|
||||
node_id = node.node.node_id
|
||||
if node_id not in seen_ids:
|
||||
seen_ids.add(node_id)
|
||||
merged.append(node)
|
||||
|
||||
# Re-rank the merged candidates with cross-encoder
|
||||
reranked = reranker.postprocess_nodes(merged, query_str=q)
|
||||
|
||||
# Report retrieval stats
|
||||
n_vector_only = len([n for n in vector_nodes if n.node.node_id not in {b.node.node_id for b in bm25_nodes}])
|
||||
n_bm25_only = len([n for n in bm25_nodes if n.node.node_id not in {v.node.node_id for v in vector_nodes}])
|
||||
n_both = len(vector_nodes) + len(bm25_nodes) - len(merged)
|
||||
|
||||
print(f"\nQuery: {q}")
|
||||
print(f"Vector: {len(vector_nodes)}, BM25: {len(bm25_nodes)}, "
|
||||
f"overlap: {n_both}, merged: {len(merged)}, re-ranked to: {len(reranked)}")
|
||||
|
||||
# Synthesize response with LLM
|
||||
synthesizer = get_response_synthesizer(text_qa_template=PROMPT)
|
||||
response = synthesizer.synthesize(q, nodes=reranked)
|
||||
|
||||
# Output
|
||||
print("\nResponse:\n")
|
||||
print(response.response)
|
||||
|
||||
print("\nSource documents:")
|
||||
for node in response.source_nodes:
|
||||
meta = getattr(node, "metadata", None) or node.node.metadata
|
||||
score = getattr(node, "score", None)
|
||||
print(f"{meta.get('file_name')} {meta.get('file_path')} {score:.3f}")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
7
04-semantic-search/requirements.txt
Normal file
7
04-semantic-search/requirements.txt
Normal file
|
|
@ -0,0 +1,7 @@
|
|||
llama-index-core
|
||||
llama-index-readers-file
|
||||
llama-index-llms-ollama
|
||||
llama-index-embeddings-huggingface
|
||||
llama-index-retrievers-bm25
|
||||
nltk
|
||||
sentence-transformers
|
||||
140
04-semantic-search/retrieve.py
Normal file
140
04-semantic-search/retrieve.py
Normal file
|
|
@ -0,0 +1,140 @@
|
|||
# retrieve.py
|
||||
# Hybrid verbatim chunk retrieval: BM25 + vector search + cross-encoder, no LLM.
|
||||
#
|
||||
# Same hybrid retrieval as query_hybrid.py but outputs raw chunk text
|
||||
# instead of LLM synthesis. Useful for inspecting what the hybrid pipeline
|
||||
# retrieves.
|
||||
#
|
||||
# Each chunk is annotated with its source (vector, BM25, or both) so you can
|
||||
# see which retriever nominated it.
|
||||
#
|
||||
# E.M.F. February 2026
|
||||
|
||||
# Environment vars must be set before importing huggingface/transformers
|
||||
# libraries, because huggingface_hub.constants evaluates HF_HUB_OFFLINE
|
||||
# at import time.
|
||||
import os
|
||||
os.environ["TOKENIZERS_PARALLELISM"] = "false"
|
||||
os.environ["SENTENCE_TRANSFORMERS_HOME"] = "./models"
|
||||
os.environ["HF_HUB_OFFLINE"] = "1"
|
||||
|
||||
from llama_index.core import (
|
||||
StorageContext,
|
||||
load_index_from_storage,
|
||||
Settings,
|
||||
)
|
||||
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
|
||||
from llama_index.core.postprocessor import SentenceTransformerRerank
|
||||
from llama_index.retrievers.bm25 import BM25Retriever
|
||||
import sys
|
||||
import textwrap
|
||||
|
||||
#
|
||||
# Globals
|
||||
#
|
||||
|
||||
# Embedding model (must match build_store.py)
|
||||
EMBED_MODEL = HuggingFaceEmbedding(cache_folder="./models", model_name="BAAI/bge-large-en-v1.5", local_files_only=True)
|
||||
|
||||
# Cross-encoder model for re-ranking (cached in ./models/)
|
||||
RERANK_MODEL = "cross-encoder/ms-marco-MiniLM-L-12-v2"
|
||||
RERANK_TOP_N = 15
|
||||
|
||||
# Retrieval parameters
|
||||
VECTOR_TOP_K = 20
|
||||
BM25_TOP_K = 20
|
||||
|
||||
# Output formatting
|
||||
WRAP_WIDTH = 80
|
||||
|
||||
|
||||
def main():
|
||||
# No LLM needed -- set embed model only
|
||||
Settings.embed_model = EMBED_MODEL
|
||||
|
||||
# Load persisted vector store
|
||||
storage_context = StorageContext.from_defaults(persist_dir="./store")
|
||||
index = load_index_from_storage(storage_context)
|
||||
|
||||
# --- Retrievers ---
|
||||
|
||||
vector_retriever = index.as_retriever(similarity_top_k=VECTOR_TOP_K)
|
||||
|
||||
bm25_retriever = BM25Retriever.from_defaults(
|
||||
index=index,
|
||||
similarity_top_k=BM25_TOP_K,
|
||||
)
|
||||
|
||||
# Cross-encoder re-ranker
|
||||
reranker = SentenceTransformerRerank(
|
||||
model=RERANK_MODEL,
|
||||
top_n=RERANK_TOP_N,
|
||||
)
|
||||
|
||||
# Query
|
||||
if len(sys.argv) < 2:
|
||||
print("Usage: python retrieve_hybrid_raw.py QUERY_TEXT")
|
||||
sys.exit(1)
|
||||
q = " ".join(sys.argv[1:])
|
||||
|
||||
# Retrieve from both sources
|
||||
vector_nodes = vector_retriever.retrieve(q)
|
||||
bm25_nodes = bm25_retriever.retrieve(q)
|
||||
|
||||
# Track which retriever found each node
|
||||
vector_ids = {n.node.node_id for n in vector_nodes}
|
||||
bm25_ids = {n.node.node_id for n in bm25_nodes}
|
||||
|
||||
# Merge and deduplicate by node ID
|
||||
seen_ids = set()
|
||||
merged = []
|
||||
for node in vector_nodes + bm25_nodes:
|
||||
node_id = node.node.node_id
|
||||
if node_id not in seen_ids:
|
||||
seen_ids.add(node_id)
|
||||
merged.append(node)
|
||||
|
||||
# Re-rank merged candidates
|
||||
reranked = reranker.postprocess_nodes(merged, query_str=q)
|
||||
|
||||
# Retrieval stats
|
||||
n_both = len(vector_ids & bm25_ids)
|
||||
n_vector_only = len(vector_ids - bm25_ids)
|
||||
n_bm25_only = len(bm25_ids - vector_ids)
|
||||
|
||||
print(f"\nQuery: {q}")
|
||||
print(f"Vector: {len(vector_nodes)}, BM25: {len(bm25_nodes)}, "
|
||||
f"overlap: {n_both}, merged: {len(merged)}, re-ranked to: {len(reranked)}")
|
||||
print(f" vector-only: {n_vector_only}, bm25-only: {n_bm25_only}, both: {n_both}\n")
|
||||
|
||||
# Output re-ranked chunks with source annotation
|
||||
for i, node in enumerate(reranked, 1):
|
||||
meta = getattr(node, "metadata", None) or node.node.metadata
|
||||
score = getattr(node, "score", None)
|
||||
file_name = meta.get("file_name", "unknown")
|
||||
text = node.get_content()
|
||||
node_id = node.node.node_id
|
||||
|
||||
# Annotate source
|
||||
in_vector = node_id in vector_ids
|
||||
in_bm25 = node_id in bm25_ids
|
||||
if in_vector and in_bm25:
|
||||
source = "vector+bm25"
|
||||
elif in_bm25:
|
||||
source = "bm25-only"
|
||||
else:
|
||||
source = "vector-only"
|
||||
|
||||
print("=" * WRAP_WIDTH)
|
||||
print(f"=== [{i}] {file_name} (score: {score:.3f}) [{source}]")
|
||||
print("=" * WRAP_WIDTH)
|
||||
for line in text.splitlines():
|
||||
if line.strip():
|
||||
print(textwrap.fill(line, width=WRAP_WIDTH))
|
||||
else:
|
||||
print()
|
||||
print()
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
41
04-semantic-search/run_query.sh
Executable file
41
04-semantic-search/run_query.sh
Executable file
|
|
@ -0,0 +1,41 @@
|
|||
#!/bin/bash
|
||||
# This shell script will handle I/O for the python query engine
|
||||
# It will take a query and return the formatted results
|
||||
|
||||
# E.M.F. August 2025
|
||||
|
||||
# Usage: ./run_query.sh
|
||||
|
||||
QUERY_SCRIPT="query_hybrid.py"
|
||||
VENV_DIR=".venv"
|
||||
|
||||
# Activate the virtual environment
|
||||
if [ -d "$VENV_DIR" ]; then
|
||||
source "$VENV_DIR/bin/activate"
|
||||
echo "Activated virtual environment: $VENV_DIR"
|
||||
else
|
||||
echo "Error: Virtual environment not found at '$VENV_DIR'" >&2
|
||||
echo "Create one with: python3 -m venv $VENV_DIR" >&2
|
||||
exit 1
|
||||
fi
|
||||
|
||||
echo -e "Current query engine is $QUERY_SCRIPT\n"
|
||||
|
||||
# Loop until input is "exit"
|
||||
while true; do
|
||||
read -p "Enter your query (or type 'exit' to quit): " query
|
||||
if [ "$query" == "exit" ] || [ "$query" == "quit" ] || [ "$query" == "" ] ; then
|
||||
echo "Exiting..."
|
||||
break
|
||||
fi
|
||||
time_start=$(date +%s)
|
||||
|
||||
# Call the python script with the query and format the output
|
||||
python3 $QUERY_SCRIPT --query "$query" | \
|
||||
expand | sed -E 's|(.* )(.*/data)|\1./data|' | fold -s -w 131
|
||||
|
||||
time_end=$(date +%s)
|
||||
elapsed=$((time_end - time_start))
|
||||
echo -e "Query processed in $elapsed seconds.\n"
|
||||
echo $query >> query.log
|
||||
done
|
||||
40
04-semantic-search/run_retrieve.sh
Executable file
40
04-semantic-search/run_retrieve.sh
Executable file
|
|
@ -0,0 +1,40 @@
|
|||
#!/bin/bash
|
||||
# This shell script will handle I/O for the python query engine
|
||||
# It will take a query and return the formatted results
|
||||
|
||||
# E.M.F. August 2025
|
||||
|
||||
# Usage: ./run_query.sh
|
||||
|
||||
QUERY_SCRIPT="retrieve.py"
|
||||
VENV_DIR=".venv"
|
||||
|
||||
# Activate the virtual environment
|
||||
if [ -d "$VENV_DIR" ]; then
|
||||
source "$VENV_DIR/bin/activate"
|
||||
echo "Activated virtual environment: $VENV_DIR"
|
||||
else
|
||||
echo "Error: Virtual environment not found at '$VENV_DIR'" >&2
|
||||
echo "Create one with: python3 -m venv $VENV_DIR" >&2
|
||||
exit 1
|
||||
fi
|
||||
|
||||
echo -e "$QUERY_SCRIPT -- retrieve vector store chunks based on similaity + BM25 with reranking.\n"
|
||||
|
||||
# Loop until input is "exit"
|
||||
while true; do
|
||||
read -p "Enter your query (or type 'exit' to quit): " query
|
||||
if [ "$query" == "exit" ] || [ "$query" == "quit" ] || [ "$query" == "" ] ; then
|
||||
echo "Exiting..."
|
||||
break
|
||||
fi
|
||||
time_start=$(date +%s)
|
||||
|
||||
# Call the python script with the query and format the output
|
||||
python3 $QUERY_SCRIPT --query "$query" | \
|
||||
expand | sed -E 's|(.* )(.*/data)|\1./data|' | fold -s -w 131
|
||||
|
||||
time_end=$(date +%s)
|
||||
elapsed=$((time_end - time_start))
|
||||
echo -e "Query processed in $elapsed seconds.\n"
|
||||
done
|
||||
189
04-semantic-search/search_keywords.py
Normal file
189
04-semantic-search/search_keywords.py
Normal file
|
|
@ -0,0 +1,189 @@
|
|||
# search_keywords.py
|
||||
# Keyword search: extract terms from a query using POS tagging, then grep
|
||||
# across journal files for matches.
|
||||
#
|
||||
# Complements the vector search pipeline by catching exact names, places,
|
||||
# and dates that embeddings can miss. No vector store or LLM needed.
|
||||
#
|
||||
# Term extraction uses NLTK POS tagging to keep nouns (NN*), proper nouns
|
||||
# (NNP*), and adjectives (JJ*) -- skipping stopwords and function words
|
||||
# automatically. Consecutive proper nouns are joined into multi-word phrases
|
||||
# (e.g., "Robert Wright" stays as one search term, not "robert" + "wright").
|
||||
#
|
||||
# E.M.F. February 2026
|
||||
|
||||
import os
|
||||
import sys
|
||||
import re
|
||||
from pathlib import Path
|
||||
|
||||
import nltk
|
||||
|
||||
#
|
||||
# Globals
|
||||
#
|
||||
DATA_DIR = Path("./data")
|
||||
CONTEXT_LINES = 2 # lines of context around each match
|
||||
MAX_MATCHES_PER_FILE = 3 # cap matches shown per file to avoid flooding
|
||||
|
||||
# POS tags to keep: nouns, proper nouns, adjectives
|
||||
KEEP_TAGS = {"NN", "NNS", "NNP", "NNPS", "JJ", "JJS", "JJR"}
|
||||
|
||||
# Proper noun tags (consecutive runs are joined as phrases)
|
||||
PROPER_NOUN_TAGS = {"NNP", "NNPS"}
|
||||
|
||||
# Minimum word length to keep (filters out short noise)
|
||||
MIN_WORD_LEN = 3
|
||||
|
||||
|
||||
def ensure_nltk_data():
|
||||
"""Download NLTK data if not already present."""
|
||||
for resource, name in [
|
||||
("tokenizers/punkt_tab", "punkt_tab"),
|
||||
("taggers/averaged_perceptron_tagger_eng", "averaged_perceptron_tagger_eng"),
|
||||
]:
|
||||
try:
|
||||
nltk.data.find(resource)
|
||||
except LookupError:
|
||||
print(f"Downloading NLTK resource: {name}")
|
||||
nltk.download(name, quiet=True)
|
||||
|
||||
|
||||
def extract_terms(query):
|
||||
"""Extract key terms from a query using POS tagging.
|
||||
|
||||
Tokenizes the query, runs POS tagging, and keeps nouns, proper nouns,
|
||||
and adjectives. Consecutive proper nouns (NNP/NNPS) are joined into
|
||||
multi-word phrases (e.g., "Robert Wright" → "robert wright").
|
||||
|
||||
Returns a list of terms (lowercase), phrases listed first.
|
||||
"""
|
||||
tokens = nltk.word_tokenize(query)
|
||||
tagged = nltk.pos_tag(tokens)
|
||||
|
||||
phrases = [] # multi-word proper noun phrases
|
||||
single_terms = [] # individual nouns/adjectives
|
||||
proper_run = [] # accumulator for consecutive proper nouns
|
||||
|
||||
for word, tag in tagged:
|
||||
if tag in PROPER_NOUN_TAGS:
|
||||
proper_run.append(word)
|
||||
else:
|
||||
# Flush any accumulated proper noun run
|
||||
if proper_run:
|
||||
phrase = " ".join(proper_run).lower()
|
||||
if len(phrase) >= MIN_WORD_LEN:
|
||||
phrases.append(phrase)
|
||||
proper_run = []
|
||||
# Keep other nouns and adjectives as single terms
|
||||
if tag in KEEP_TAGS and len(word) >= MIN_WORD_LEN:
|
||||
single_terms.append(word.lower())
|
||||
|
||||
# Flush final proper noun run
|
||||
if proper_run:
|
||||
phrase = " ".join(proper_run).lower()
|
||||
if len(phrase) >= MIN_WORD_LEN:
|
||||
phrases.append(phrase)
|
||||
|
||||
# Phrases first (more specific), then single terms
|
||||
all_terms = phrases + single_terms
|
||||
return list(dict.fromkeys(all_terms)) # deduplicate, preserve order
|
||||
|
||||
|
||||
def search_files(terms, data_dir, context_lines=CONTEXT_LINES):
|
||||
"""Search all .txt files in data_dir for the given terms.
|
||||
|
||||
Returns a list of (file_path, match_count, matches) where matches is a
|
||||
list of (line_number, context_block) tuples.
|
||||
"""
|
||||
if not terms:
|
||||
return []
|
||||
|
||||
# Build a single regex pattern that matches any term (case-insensitive)
|
||||
pattern = re.compile(
|
||||
r"\b(" + "|".join(re.escape(t) for t in terms) + r")\b",
|
||||
re.IGNORECASE
|
||||
)
|
||||
|
||||
results = []
|
||||
txt_files = sorted(data_dir.glob("*.txt"))
|
||||
|
||||
for fpath in txt_files:
|
||||
try:
|
||||
lines = fpath.read_text(encoding="utf-8").splitlines()
|
||||
except (OSError, UnicodeDecodeError):
|
||||
continue
|
||||
|
||||
matches = []
|
||||
match_count = 0
|
||||
seen_lines = set() # avoid overlapping context blocks
|
||||
|
||||
for i, line in enumerate(lines):
|
||||
if pattern.search(line):
|
||||
match_count += 1
|
||||
if i in seen_lines:
|
||||
continue
|
||||
|
||||
# Extract context window
|
||||
start = max(0, i - context_lines)
|
||||
end = min(len(lines), i + context_lines + 1)
|
||||
block = []
|
||||
for j in range(start, end):
|
||||
seen_lines.add(j)
|
||||
marker = ">>>" if j == i else " "
|
||||
block.append(f" {marker} {j+1:4d}: {lines[j]}")
|
||||
|
||||
matches.append((i + 1, "\n".join(block)))
|
||||
|
||||
if match_count > 0:
|
||||
results.append((fpath, match_count, matches))
|
||||
|
||||
# Sort by match count (most matches first)
|
||||
results.sort(key=lambda x: x[1], reverse=True)
|
||||
return results
|
||||
|
||||
|
||||
def main():
|
||||
if len(sys.argv) < 2:
|
||||
print("Usage: python search_keywords.py QUERY_TEXT")
|
||||
sys.exit(1)
|
||||
|
||||
ensure_nltk_data()
|
||||
|
||||
q = " ".join(sys.argv[1:])
|
||||
|
||||
# Extract terms
|
||||
terms = extract_terms(q)
|
||||
if not terms:
|
||||
print(f"Query: {q}")
|
||||
print("No searchable terms extracted. Try a more specific query.")
|
||||
sys.exit(0)
|
||||
|
||||
print(f"Query: {q}")
|
||||
print(f"Extracted terms: {', '.join(terms)}\n")
|
||||
|
||||
# Search
|
||||
results = search_files(terms, DATA_DIR)
|
||||
|
||||
if not results:
|
||||
print("No matches found.")
|
||||
sys.exit(0)
|
||||
|
||||
# Summary
|
||||
total_matches = sum(r[1] for r in results)
|
||||
print(f"Found {total_matches} matches across {len(results)} files\n")
|
||||
|
||||
# Detailed output
|
||||
for fpath, match_count, matches in results:
|
||||
print("="*60)
|
||||
print(f"--- {fpath.name} ({match_count} matches) ---")
|
||||
print("="*60)
|
||||
for line_num, block in matches[:MAX_MATCHES_PER_FILE]:
|
||||
print(block)
|
||||
print()
|
||||
if len(matches) > MAX_MATCHES_PER_FILE:
|
||||
print(f" ... and {len(matches) - MAX_MATCHES_PER_FILE} more matches\n")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
Loading…
Add table
Add a link
Reference in a new issue