Eric f037c50736 Initial project planning docs for UD administrative analytics

- Project scope document (v0.1): objectives, data sources, key metrics, phases
- Phase 1 implementation plan: IPEDS, IRS 990, BLS CPI-U acquisition for UD
- CLAUDE.md: project context and conventions

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-03-29 18:28:30 -04:00

10 KiB

Raw Blame History

Phase 1 — Data Acquisition: Implementation Plan

Version: 0.1 | Status: Draft | Date: March 2026

Summary

Phase 1 builds the data ingestion layer for three primary sources (IRS 990 XML, IPEDS CSV, BLS CPI-U) and a stretch-goal web scraper, scoped to the University of Delaware only. Peer/AAU/multi-institution comparisons are deferred to a later iteration. The deliverable is a populated DuckDB database with UD's raw data ready for Phase 2 normalization.

Data Source Details

IRS 990 Bulk XML

Location: https://apps.irs.gov/pub/epostcard/990/xml/{YEAR}/

Key extractions:

Filing header: EIN, tax year, org name, total revenue/expenses/assets
Part VII: per-person compensation summary (name, title, hours, reportable comp)
Schedule J: detailed compensation breakdown per person — base, bonus, other, deferred, nontaxable benefits, total

Schema considerations:

XML element names vary across tax years. The Master Concordance File maps field variations.
IRSx (990-xml-reader) handles versioned XPath differences — evaluate first, fallback to custom lxml parser.
Only private/nonprofit universities file 990s. UD is a public university and does not file a 990. However, the University of Delaware Foundation (a separate nonprofit) does file a 990 — this is the source for UD executive compensation and philanthropic data.
For the first iteration, filter to UD/UD Foundation EINs only. Broader higher-ed filtering (NTEE codes B40-B43, EIN crosswalk) is deferred to the multi-institution iteration.

IPEDS Bulk CSV

Location: https://nces.ed.gov/ipeds/datacenter/ — Complete Data Files

Key survey components:

Component	What it provides
HD (Directory)	UNITID, name, EIN, Carnegie classification, sector, control
F1A / F2 (Finance)	Expenses by function — instruction, research, academic support, institutional support (admin), etc.
S / SAL (HR)	Staff counts by occupational category, faculty counts, salary outlays
EF (Enrollment)	Student headcounts for per-student cost calculations

Note: Variable names change across years. Always parse the accompanying data dictionary; never hardcode variable names.

BLS CPI-U

Series ID: CUUR0000SA0 — CPI-U, Not Seasonally Adjusted, U.S. City Average, All Items

Preferred method: Flat-file download from https://download.bls.gov/pub/time.series/cu/cu.data.0.Current (simpler than API, no rate limits, full history in one file).

Alternative: BLS API v2 at https://api.bls.gov/publicAPI/v2/timeseries/data/ (requires free registration key).

Stretch: Admin Office Web Pages

First iteration targets University of Delaware only. Curate URLs for UD's Office of the President, Provost, VP units, and college admin pages. Use requests + BeautifulSoup to extract staff directory headcounts. Expand to 20-30 peer institutions in a later iteration.

Sprint Plan

Sprint 1 (Weeks 1-2): Foundation + IPEDS

IPEDS goes first — it provides UD's UNITID and institutional context. All IPEDS downloads can pull full files but loading is filtered to UD's UNITID only.

Task	Description
1.1	Project scaffolding: `pyproject.toml`, directory structure, DuckDB setup, CLI skeleton
1.2	IPEDS Downloader: download HD complete data files for 2005-2024, parse data dictionaries
1.3	IPEDS Institution Loader: parse HD files, filter to UD's UNITID, load `raw_institution`
1.4	IPEDS Finance Parser: download and parse F1A (GASB — UD is public) finance files for UD, map year-varying variable names to canonical columns
1.5	IPEDS HR Parser: download and parse Fall Staff / Salaries survey files for UD
1.6	IPEDS Enrollment: parse EF for UD total headcount
1.7	Tests for all IPEDS parsers using fixture files

Sprint 2 (Weeks 3-4): IRS 990

Task	Description
2.1	Identify UD-related EINs: University of Delaware Foundation and any other UD-affiliated nonprofit entities that file 990s
2.2	990 Downloader: download yearly index CSVs, filter to UD-related EINs only, download matching XML files
2.3	990 XML Parser — Filing Header: extract EIN, tax year, org name, return type, financials
2.4	990 XML Parser — Part VII: extract per-person compensation summaries, handle XPath variations
2.5	990 XML Parser — Schedule J: extract detailed compensation breakdowns (base, bonus, other, deferred, nontaxable, total)
2.6	Title Normalization: regex patterns + manual mapping to bucket titles into canonical roles (President, Provost, VP, Dean, CFO, etc.)
2.7	Tests with fixture XML files covering different schema years

Sprint 3 (Week 5): BLS CPI-U + Integration

Task	Description
3.1	BLS CPI-U Fetcher: flat-file download of `cu.data.0.Current`, load `raw_cpi_u`
3.2	CLI wiring: `ingest ipeds`, `ingest irs990`, `ingest cpi`, `ingest all` with `--year-range` and `--force-redownload` flags
3.3	Data validation queries: row counts, NULL rates, year coverage, cross-source consistency
3.4	Data dictionary documentation

Sprint 4 (Week 6): Stretch — Admin Page Scraper

Task	Description
4.1	Curate UD admin office URLs: Office of the President, Provost, VP units, college dean pages
4.2	Scraper prototype: `requests` + `BeautifulSoup` for UD staff directory pages
4.3	Headcount extraction: parse pages for staff names/counts, load `raw_admin_headcount`
4.4	Document limitations and accuracy

Dependency Graph

Sprint 1: IPEDS  ──────────────────┐
  (provides UNITID-EIN crosswalk)  │
                                    ▼
Sprint 2: IRS 990 ────────────────┐
  (uses EIN crosswalk to filter)  │
                                   │
Sprint 3: BLS CPI-U (independent) │
  + CLI + Validation ─────────────┤
                                   ▼
Sprint 4: Stretch scraper ────── Phase 1 Complete

Technical Decisions

Decision	Choice	Rationale
Database	DuckDB	Zero-config, fast analytical queries, native CSV/Parquet, single-file portable. Migrate to PostgreSQL in Phase 3.
Package manager	uv	Fast, modern, handles virtualenvs and lockfiles.
990 XML parsing	Evaluate IRSx first, fallback to custom lxml + Master Concordance File	IRSx handles schema versioning. If unmaintained or incomplete for Schedule J, build custom.
IPEDS variable mapping	Map to canonical names at ingest	Keeps DB schema stable across years. Raw files stay on disk.
990 download strategy	Filter to UD Foundation EIN(s) from index files	First iteration is a single institution; broader filtering deferred.
BLS data	Flat-file download	Simpler than API, no rate limits, single file covers full history.
Admin scraper	requests + BeautifulSoup	Scrapy is overkill for 20-30 targeted sites.

Database Schema

raw_institution (from IPEDS HD)

unitid (PK), ein, institution_name, city, state, sector, control, carnegie_class, enrollment_total, year

raw_990_filing (IRS 990 header)

object_id (PK), ein, tax_year, organization_name, return_type, filing_date, total_revenue, total_expenses, total_assets

raw_990_schedule_j (one row per person per filing)

id (PK), object_id (FK), ein, tax_year, person_name, title, base_compensation, bonus_compensation, other_compensation, deferred_compensation, nontaxable_benefits, total_compensation, compensation_from_related

raw_990_part_vii (one row per person per filing)

id (PK), object_id (FK), ein, tax_year, person_name, title, avg_hours_per_week, reportable_comp_from_org, reportable_comp_from_related, other_compensation

raw_ipeds_finance (one row per institution per year)

unitid, year (composite PK), reporting_standard, total_expenses, instruction_expenses, research_expenses, public_service_expenses, academic_support_expenses, student_services_expenses, institutional_support_expenses, auxiliary_expenses, hospital_expenses, other_expenses, salaries_wages, benefits

raw_ipeds_staff

unitid, year (composite PK), total_staff, faculty_total, management_total

raw_cpi_u

year, month (composite PK), value, series_id

raw_admin_headcount (stretch)

id (PK), unitid, institution_name, admin_unit, page_url, scrape_date, staff_count, staff_names

Key Libraries

lxml — XML parsing for 990 filings
duckdb — database engine
httpx or requests — HTTP downloads and BLS API
polars or pandas — CSV processing for IPEDS
typer or click — CLI framework
beautifulsoup4 — stretch goal scraper
irsx — evaluate for 990 XML parsing
pytest — testing

Risks

Risk	Mitigation
IRS XML schema variations break parsing	Use Master Concordance File or IRSx. Test fixtures from multiple schema years.
IPEDS variable names change year to year	Always parse the data dictionary alongside each file.
Bulk 990 download too large	First iteration is UD only — minimal download. Design for broader filtering in later iteration.
IRSx library unmaintained	Evaluate early in Sprint 2. Fallback: custom lxml parser with concordance mappings.
Title normalization unreliable	Start with high-confidence exact/regex matches. Flag ambiguous titles for manual review.

References

Generated by Claude · Administrative Analytics Project

10 KiB Raw Blame History