# Phase 1 — Data Acquisition: Implementation Plan **Version:** 0.1 | **Status:** Draft | **Date:** March 2026 --- ## Summary Phase 1 builds the data ingestion layer for three primary sources (IRS 990 XML, IPEDS CSV, BLS CPI-U) and a stretch-goal web scraper, scoped to the **University of Delaware** only. Peer/AAU/multi-institution comparisons are deferred to a later iteration. The deliverable is a populated DuckDB database with UD's raw data ready for Phase 2 normalization. --- ## Data Source Details ### IRS 990 Bulk XML **Location:** `https://apps.irs.gov/pub/epostcard/990/xml/{YEAR}/` **Key extractions:** - Filing header: EIN, tax year, org name, total revenue/expenses/assets - Part VII: per-person compensation summary (name, title, hours, reportable comp) - **Schedule J**: detailed compensation breakdown per person — base, bonus, other, deferred, nontaxable benefits, total **Schema considerations:** - XML element names vary across tax years. The [Master Concordance File](https://nonprofit-open-data-collective.github.io/irs-efile-master-concordance-file/) maps field variations. - [IRSx (990-xml-reader)](https://github.com/jsfenfen/990-xml-reader) handles versioned XPath differences — evaluate first, fallback to custom lxml parser. - Only private/nonprofit universities file 990s. UD is a public university and does **not** file a 990. However, the University of Delaware Foundation (a separate nonprofit) does file a 990 — this is the source for UD executive compensation and philanthropic data. - For the first iteration, filter to UD/UD Foundation EINs only. Broader higher-ed filtering (NTEE codes B40-B43, EIN crosswalk) is deferred to the multi-institution iteration. ### IPEDS Bulk CSV **Location:** `https://nces.ed.gov/ipeds/datacenter/` — Complete Data Files **Key survey components:** | Component | What it provides | |-----------|-----------------| | HD (Directory) | UNITID, name, EIN, Carnegie classification, sector, control | | F1A / F2 (Finance) | Expenses by function — instruction, research, academic support, **institutional support (admin)**, etc. | | S / SAL (HR) | Staff counts by occupational category, faculty counts, salary outlays | | EF (Enrollment) | Student headcounts for per-student cost calculations | **Note:** Variable names change across years. Always parse the accompanying data dictionary; never hardcode variable names. ### BLS CPI-U **Series ID:** `CUUR0000SA0` — CPI-U, Not Seasonally Adjusted, U.S. City Average, All Items **Preferred method:** Flat-file download from `https://download.bls.gov/pub/time.series/cu/cu.data.0.Current` (simpler than API, no rate limits, full history in one file). **Alternative:** BLS API v2 at `https://api.bls.gov/publicAPI/v2/timeseries/data/` (requires free registration key). ### Stretch: Admin Office Web Pages First iteration targets **University of Delaware only**. Curate URLs for UD's Office of the President, Provost, VP units, and college admin pages. Use `requests` + `BeautifulSoup` to extract staff directory headcounts. Expand to 20-30 peer institutions in a later iteration. --- ## Sprint Plan ### Sprint 1 (Weeks 1-2): Foundation + IPEDS IPEDS goes first — it provides UD's UNITID and institutional context. All IPEDS downloads can pull full files but loading is filtered to **UD's UNITID only**. | Task | Description | |------|-------------| | 1.1 | Project scaffolding: `pyproject.toml`, directory structure, DuckDB setup, CLI skeleton | | 1.2 | IPEDS Downloader: download HD complete data files for 2005-2024, parse data dictionaries | | 1.3 | IPEDS Institution Loader: parse HD files, filter to UD's UNITID, load `raw_institution` | | 1.4 | IPEDS Finance Parser: download and parse F1A (GASB — UD is public) finance files for UD, map year-varying variable names to canonical columns | | 1.5 | IPEDS HR Parser: download and parse Fall Staff / Salaries survey files for UD | | 1.6 | IPEDS Enrollment: parse EF for UD total headcount | | 1.7 | Tests for all IPEDS parsers using fixture files | ### Sprint 2 (Weeks 3-4): IRS 990 | Task | Description | |------|-------------| | 2.1 | Identify UD-related EINs: University of Delaware Foundation and any other UD-affiliated nonprofit entities that file 990s | | 2.2 | 990 Downloader: download yearly index CSVs, filter to UD-related EINs only, download matching XML files | | 2.3 | 990 XML Parser — Filing Header: extract EIN, tax year, org name, return type, financials | | 2.4 | 990 XML Parser — Part VII: extract per-person compensation summaries, handle XPath variations | | 2.5 | 990 XML Parser — Schedule J: extract detailed compensation breakdowns (base, bonus, other, deferred, nontaxable, total) | | 2.6 | Title Normalization: regex patterns + manual mapping to bucket titles into canonical roles (President, Provost, VP, Dean, CFO, etc.) | | 2.7 | Tests with fixture XML files covering different schema years | ### Sprint 3 (Week 5): BLS CPI-U + Integration | Task | Description | |------|-------------| | 3.1 | BLS CPI-U Fetcher: flat-file download of `cu.data.0.Current`, load `raw_cpi_u` | | 3.2 | CLI wiring: `ingest ipeds`, `ingest irs990`, `ingest cpi`, `ingest all` with `--year-range` and `--force-redownload` flags | | 3.3 | Data validation queries: row counts, NULL rates, year coverage, cross-source consistency | | 3.4 | Data dictionary documentation | ### Sprint 4 (Week 6): Stretch — Admin Page Scraper | Task | Description | |------|-------------| | 4.1 | Curate UD admin office URLs: Office of the President, Provost, VP units, college dean pages | | 4.2 | Scraper prototype: `requests` + `BeautifulSoup` for UD staff directory pages | | 4.3 | Headcount extraction: parse pages for staff names/counts, load `raw_admin_headcount` | | 4.4 | Document limitations and accuracy | --- ## Dependency Graph ``` Sprint 1: IPEDS ──────────────────┐ (provides UNITID-EIN crosswalk) │ ▼ Sprint 2: IRS 990 ────────────────┐ (uses EIN crosswalk to filter) │ │ Sprint 3: BLS CPI-U (independent) │ + CLI + Validation ─────────────┤ ▼ Sprint 4: Stretch scraper ────── Phase 1 Complete │ Phase 2: SKIPPED (folded into dashboard queries) │ Sprint 5: Dash dashboard ──────── Phase 3 Prototype ``` --- ## Phase 2 Skip Decision (March 2026) Phase 2 (data pipeline & normalization) was **skipped** for the initial prototype. All derived metrics — admin cost ratios, CPI adjustments, compensation growth indices — are computed directly in dashboard SQL queries rather than a separate normalized schema. **Rationale:** With a single institution (UD) and a populated DuckDB, the query layer is sufficient for a local prototype. A proper Phase 2 with dbt transformations and a unified analytical schema should be built before: - Expanding to multi-institution comparisons (Phase 4) - Moving to a production React dashboard (Phase 3 full build) - Adding complex cross-source joins that benefit from materialized views --- ## Technical Decisions | Decision | Choice | Rationale | |----------|--------|-----------| | Database | DuckDB | Zero-config, fast analytical queries, native CSV/Parquet, single-file portable. Migrate to PostgreSQL in Phase 3. | | Package manager | uv | Fast, modern, handles virtualenvs and lockfiles. | | 990 XML parsing | Evaluate IRSx first, fallback to custom lxml + Master Concordance File | IRSx handles schema versioning. If unmaintained or incomplete for Schedule J, build custom. | | IPEDS variable mapping | Map to canonical names at ingest | Keeps DB schema stable across years. Raw files stay on disk. | | 990 download strategy | Filter to UD Foundation EIN(s) from index files | First iteration is a single institution; broader filtering deferred. | | BLS data | Flat-file download | Simpler than API, no rate limits, single file covers full history. | | Admin scraper | requests + BeautifulSoup | Scrapy is overkill for 20-30 targeted sites. | --- ## Database Schema ### raw_institution (from IPEDS HD) - `unitid` (PK), `ein`, `institution_name`, `city`, `state`, `sector`, `control`, `carnegie_class`, `enrollment_total`, `year` ### raw_990_filing (IRS 990 header) - `object_id` (PK), `ein`, `tax_year`, `organization_name`, `return_type`, `filing_date`, `total_revenue`, `total_expenses`, `total_assets` ### raw_990_schedule_j (one row per person per filing) - `id` (PK), `object_id` (FK), `ein`, `tax_year`, `person_name`, `title`, `base_compensation`, `bonus_compensation`, `other_compensation`, `deferred_compensation`, `nontaxable_benefits`, `total_compensation`, `compensation_from_related` ### raw_990_part_vii (one row per person per filing) - `id` (PK), `object_id` (FK), `ein`, `tax_year`, `person_name`, `title`, `avg_hours_per_week`, `reportable_comp_from_org`, `reportable_comp_from_related`, `other_compensation` ### raw_ipeds_finance (one row per institution per year) - `unitid`, `year` (composite PK), `reporting_standard`, `total_expenses`, `instruction_expenses`, `research_expenses`, `public_service_expenses`, `academic_support_expenses`, `student_services_expenses`, `institutional_support_expenses`, `auxiliary_expenses`, `hospital_expenses`, `other_expenses`, `salaries_wages`, `benefits` ### raw_ipeds_staff - `unitid`, `year` (composite PK), `total_staff`, `faculty_total`, `management_total` ### raw_cpi_u - `year`, `month` (composite PK), `value`, `series_id` ### raw_admin_headcount (stretch) - `id` (PK), `unitid`, `institution_name`, `admin_unit`, `page_url`, `scrape_date`, `staff_count`, `staff_names` --- ## Key Libraries - `lxml` — XML parsing for 990 filings - `duckdb` — database engine - `httpx` or `requests` — HTTP downloads and BLS API - `polars` or `pandas` — CSV processing for IPEDS - `typer` or `click` — CLI framework - `beautifulsoup4` — stretch goal scraper - `irsx` — evaluate for 990 XML parsing - `pytest` — testing --- ## Risks | Risk | Mitigation | |------|-----------| | IRS XML schema variations break parsing | Use Master Concordance File or IRSx. Test fixtures from multiple schema years. | | IPEDS variable names change year to year | Always parse the data dictionary alongside each file. | | Bulk 990 download too large | First iteration is UD only — minimal download. Design for broader filtering in later iteration. | | IRSx library unmaintained | Evaluate early in Sprint 2. Fallback: custom lxml parser with concordance mappings. | | Title normalization unreliable | Start with high-confidence exact/regex matches. Flag ambiguous titles for manual review. | --- ## References - [IRS Form 990 Series Downloads](https://www.irs.gov/charities-non-profits/form-990-series-downloads) - [Master Concordance File](https://nonprofit-open-data-collective.github.io/irs-efile-master-concordance-file/) - [IRSx 990-xml-reader](https://github.com/jsfenfen/990-xml-reader) - [Form 990 XML Schema Mapper](https://github.com/Giving-Tuesday/form-990-xml-mapper) - [Schedule J Instructions](https://www.irs.gov/instructions/i990sj) - [IPEDS Data Center](https://nces.ed.gov/ipeds/use-the-data) - [Urban Institute IPEDS Scraper](https://github.com/UrbanInstitute/ipeds-scraper) - [BLS API v2](https://www.bls.gov/developers/api_signature_v2.htm) - [BLS CPI Series IDs](https://www.bls.gov/cpi/factsheets/cpi-series-ids.htm) --- *Generated by Claude · Administrative Analytics Project*