AdminAnalytics/CLAUDE.md
Eric f037c50736 Initial project planning docs for UD administrative analytics
- Project scope document (v0.1): objectives, data sources, key metrics, phases
- Phase 1 implementation plan: IPEDS, IRS 990, BLS CPI-U acquisition for UD
- CLAUDE.md: project context and conventions

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-29 18:28:30 -04:00

3.4 KiB

Admin Analytics

University administrative cost benchmarking project using public data (IRS 990, IPEDS, BLS CPI-U). First iteration is scoped to the University of Delaware only. Peer/AAU/multi-institution comparisons are planned for a later iteration.

Project status

Currently in planning. Phase 1 (Data Acquisition) is planned but not yet built. See phase1_plan.md for the full implementation plan and administrative_analytics_scope_v0.1.md for project scope.

Architecture

  • Language: Python
  • Database: DuckDB for Phase 1 (single-file, zero-config). Migrate to PostgreSQL in Phase 3 when the dashboard needs concurrent access.
  • Package manager: uv
  • CLI framework: typer or click (TBD)
  • Testing: pytest

Data sources

Source Format What we extract
IRS 990 bulk XML XML (versioned schemas) Filing financials, Part VII compensation, Schedule J detailed compensation
IPEDS CSV bulk downloads Institution directory (HD), finance by function (F1A/F2), staffing (S/SAL), enrollment (EF)
BLS CPI-U Flat file or API Consumer Price Index for inflation-adjusted compensation analysis
Admin office web pages (stretch) HTML scraping Staff directory headcounts

Key concepts

  • University of Delaware is the sole target institution for the first iteration. UD's IPEDS UNITID is the anchor for all IPEDS queries.
  • UD is a public university and does not file an IRS 990. However, the University of Delaware Foundation (a separate nonprofit) does file a 990 — this is the source for executive compensation (Schedule J) and philanthropic data.
  • UNITID is the canonical institution identifier (from IPEDS). All cross-source linking flows through UNITID.
  • EIN links to IRS 990 filings. For the first iteration, only UD Foundation EIN(s) are needed. A broader UNITID-to-EIN crosswalk will be built when expanding to peer institutions.
  • IRS 990 XML schemas change across tax years. Use the Master Concordance File or IRSx library to handle XPath variations.
  • IPEDS variable names change across years. Always parse the accompanying data dictionary; never hardcode variable names.

Planned project structure

src/admin_analytics/
  config.py
  cli.py
  db/           # DuckDB schema and connection
  irs990/       # 990 download, XML parsing, Schedule J extraction, university filtering
  ipeds/        # IPEDS download, dictionary parsing, finance/HR/enrollment loading
  bls/          # CPI-U fetcher and loader
  scraper/      # Stretch: admin office headcount scraper
data/raw/       # Downloaded files (gitignored)
tests/
  fixtures/     # Sample XML/CSV files for tests

Build & run

Not yet implemented. When built, the CLI will support:

admin-analytics ingest ipeds --year-range 2005-2024
admin-analytics ingest irs990 --year-range 2005-2024
admin-analytics ingest cpi
admin-analytics ingest all

Conventions

  • Raw data tables are prefixed with raw_ (e.g., raw_institution, raw_990_schedule_j)
  • Downloaded files go in data/raw/ and are gitignored
  • IPEDS variables are mapped to canonical column names at ingest time; raw CSVs stay on disk for reprocessing
  • First iteration filters all data to UD/UD Foundation only. Design parsers to accept institution filters so they can scale to multi-institution in a later iteration
  • 990 downloads are filtered by EIN from index files to avoid downloading the full archive (hundreds of GB)