Initial project planning docs for UD administrative analytics
- Project scope document (v0.1): objectives, data sources, key metrics, phases - Phase 1 implementation plan: IPEDS, IRS 990, BLS CPI-U acquisition for UD - CLAUDE.md: project context and conventions Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
commit
f037c50736
3 changed files with 390 additions and 0 deletions
67
CLAUDE.md
Normal file
67
CLAUDE.md
Normal file
|
|
@ -0,0 +1,67 @@
|
|||
# Admin Analytics
|
||||
|
||||
University administrative cost benchmarking project using public data (IRS 990, IPEDS, BLS CPI-U). **First iteration is scoped to the University of Delaware only.** Peer/AAU/multi-institution comparisons are planned for a later iteration.
|
||||
|
||||
## Project status
|
||||
|
||||
Currently in planning. Phase 1 (Data Acquisition) is planned but not yet built. See `phase1_plan.md` for the full implementation plan and `administrative_analytics_scope_v0.1.md` for project scope.
|
||||
|
||||
## Architecture
|
||||
|
||||
- **Language:** Python
|
||||
- **Database:** DuckDB for Phase 1 (single-file, zero-config). Migrate to PostgreSQL in Phase 3 when the dashboard needs concurrent access.
|
||||
- **Package manager:** uv
|
||||
- **CLI framework:** typer or click (TBD)
|
||||
- **Testing:** pytest
|
||||
|
||||
## Data sources
|
||||
|
||||
| Source | Format | What we extract |
|
||||
|--------|--------|-----------------|
|
||||
| IRS 990 bulk XML | XML (versioned schemas) | Filing financials, Part VII compensation, Schedule J detailed compensation |
|
||||
| IPEDS | CSV bulk downloads | Institution directory (HD), finance by function (F1A/F2), staffing (S/SAL), enrollment (EF) |
|
||||
| BLS CPI-U | Flat file or API | Consumer Price Index for inflation-adjusted compensation analysis |
|
||||
| Admin office web pages (stretch) | HTML scraping | Staff directory headcounts |
|
||||
|
||||
## Key concepts
|
||||
|
||||
- **University of Delaware** is the sole target institution for the first iteration. UD's IPEDS UNITID is the anchor for all IPEDS queries.
|
||||
- **UD is a public university** and does not file an IRS 990. However, the **University of Delaware Foundation** (a separate nonprofit) does file a 990 — this is the source for executive compensation (Schedule J) and philanthropic data.
|
||||
- **UNITID** is the canonical institution identifier (from IPEDS). All cross-source linking flows through UNITID.
|
||||
- **EIN** links to IRS 990 filings. For the first iteration, only UD Foundation EIN(s) are needed. A broader UNITID-to-EIN crosswalk will be built when expanding to peer institutions.
|
||||
- IRS 990 XML schemas change across tax years. Use the Master Concordance File or IRSx library to handle XPath variations.
|
||||
- IPEDS variable names change across years. Always parse the accompanying data dictionary; never hardcode variable names.
|
||||
|
||||
## Planned project structure
|
||||
|
||||
```
|
||||
src/admin_analytics/
|
||||
config.py
|
||||
cli.py
|
||||
db/ # DuckDB schema and connection
|
||||
irs990/ # 990 download, XML parsing, Schedule J extraction, university filtering
|
||||
ipeds/ # IPEDS download, dictionary parsing, finance/HR/enrollment loading
|
||||
bls/ # CPI-U fetcher and loader
|
||||
scraper/ # Stretch: admin office headcount scraper
|
||||
data/raw/ # Downloaded files (gitignored)
|
||||
tests/
|
||||
fixtures/ # Sample XML/CSV files for tests
|
||||
```
|
||||
|
||||
## Build & run
|
||||
|
||||
Not yet implemented. When built, the CLI will support:
|
||||
```
|
||||
admin-analytics ingest ipeds --year-range 2005-2024
|
||||
admin-analytics ingest irs990 --year-range 2005-2024
|
||||
admin-analytics ingest cpi
|
||||
admin-analytics ingest all
|
||||
```
|
||||
|
||||
## Conventions
|
||||
|
||||
- Raw data tables are prefixed with `raw_` (e.g., `raw_institution`, `raw_990_schedule_j`)
|
||||
- Downloaded files go in `data/raw/` and are gitignored
|
||||
- IPEDS variables are mapped to canonical column names at ingest time; raw CSVs stay on disk for reprocessing
|
||||
- First iteration filters all data to UD/UD Foundation only. Design parsers to accept institution filters so they can scale to multi-institution in a later iteration
|
||||
- 990 downloads are filtered by EIN from index files to avoid downloading the full archive (hundreds of GB)
|
||||
Loading…
Add table
Add a link
Reference in a new issue