- Project scope document (v0.1): objectives, data sources, key metrics, phases - Phase 1 implementation plan: IPEDS, IRS 990, BLS CPI-U acquisition for UD - CLAUDE.md: project context and conventions Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
10 KiB
Phase 1 — Data Acquisition: Implementation Plan
Version: 0.1 | Status: Draft | Date: March 2026
Summary
Phase 1 builds the data ingestion layer for three primary sources (IRS 990 XML, IPEDS CSV, BLS CPI-U) and a stretch-goal web scraper, scoped to the University of Delaware only. Peer/AAU/multi-institution comparisons are deferred to a later iteration. The deliverable is a populated DuckDB database with UD's raw data ready for Phase 2 normalization.
Data Source Details
IRS 990 Bulk XML
Location: https://apps.irs.gov/pub/epostcard/990/xml/{YEAR}/
Key extractions:
- Filing header: EIN, tax year, org name, total revenue/expenses/assets
- Part VII: per-person compensation summary (name, title, hours, reportable comp)
- Schedule J: detailed compensation breakdown per person — base, bonus, other, deferred, nontaxable benefits, total
Schema considerations:
- XML element names vary across tax years. The Master Concordance File maps field variations.
- IRSx (990-xml-reader) handles versioned XPath differences — evaluate first, fallback to custom lxml parser.
- Only private/nonprofit universities file 990s. UD is a public university and does not file a 990. However, the University of Delaware Foundation (a separate nonprofit) does file a 990 — this is the source for UD executive compensation and philanthropic data.
- For the first iteration, filter to UD/UD Foundation EINs only. Broader higher-ed filtering (NTEE codes B40-B43, EIN crosswalk) is deferred to the multi-institution iteration.
IPEDS Bulk CSV
Location: https://nces.ed.gov/ipeds/datacenter/ — Complete Data Files
Key survey components:
| Component | What it provides |
|---|---|
| HD (Directory) | UNITID, name, EIN, Carnegie classification, sector, control |
| F1A / F2 (Finance) | Expenses by function — instruction, research, academic support, institutional support (admin), etc. |
| S / SAL (HR) | Staff counts by occupational category, faculty counts, salary outlays |
| EF (Enrollment) | Student headcounts for per-student cost calculations |
Note: Variable names change across years. Always parse the accompanying data dictionary; never hardcode variable names.
BLS CPI-U
Series ID: CUUR0000SA0 — CPI-U, Not Seasonally Adjusted, U.S. City Average, All Items
Preferred method: Flat-file download from https://download.bls.gov/pub/time.series/cu/cu.data.0.Current (simpler than API, no rate limits, full history in one file).
Alternative: BLS API v2 at https://api.bls.gov/publicAPI/v2/timeseries/data/ (requires free registration key).
Stretch: Admin Office Web Pages
First iteration targets University of Delaware only. Curate URLs for UD's Office of the President, Provost, VP units, and college admin pages. Use requests + BeautifulSoup to extract staff directory headcounts. Expand to 20-30 peer institutions in a later iteration.
Sprint Plan
Sprint 1 (Weeks 1-2): Foundation + IPEDS
IPEDS goes first — it provides UD's UNITID and institutional context. All IPEDS downloads can pull full files but loading is filtered to UD's UNITID only.
| Task | Description |
|---|---|
| 1.1 | Project scaffolding: pyproject.toml, directory structure, DuckDB setup, CLI skeleton |
| 1.2 | IPEDS Downloader: download HD complete data files for 2005-2024, parse data dictionaries |
| 1.3 | IPEDS Institution Loader: parse HD files, filter to UD's UNITID, load raw_institution |
| 1.4 | IPEDS Finance Parser: download and parse F1A (GASB — UD is public) finance files for UD, map year-varying variable names to canonical columns |
| 1.5 | IPEDS HR Parser: download and parse Fall Staff / Salaries survey files for UD |
| 1.6 | IPEDS Enrollment: parse EF for UD total headcount |
| 1.7 | Tests for all IPEDS parsers using fixture files |
Sprint 2 (Weeks 3-4): IRS 990
| Task | Description |
|---|---|
| 2.1 | Identify UD-related EINs: University of Delaware Foundation and any other UD-affiliated nonprofit entities that file 990s |
| 2.2 | 990 Downloader: download yearly index CSVs, filter to UD-related EINs only, download matching XML files |
| 2.3 | 990 XML Parser — Filing Header: extract EIN, tax year, org name, return type, financials |
| 2.4 | 990 XML Parser — Part VII: extract per-person compensation summaries, handle XPath variations |
| 2.5 | 990 XML Parser — Schedule J: extract detailed compensation breakdowns (base, bonus, other, deferred, nontaxable, total) |
| 2.6 | Title Normalization: regex patterns + manual mapping to bucket titles into canonical roles (President, Provost, VP, Dean, CFO, etc.) |
| 2.7 | Tests with fixture XML files covering different schema years |
Sprint 3 (Week 5): BLS CPI-U + Integration
| Task | Description |
|---|---|
| 3.1 | BLS CPI-U Fetcher: flat-file download of cu.data.0.Current, load raw_cpi_u |
| 3.2 | CLI wiring: ingest ipeds, ingest irs990, ingest cpi, ingest all with --year-range and --force-redownload flags |
| 3.3 | Data validation queries: row counts, NULL rates, year coverage, cross-source consistency |
| 3.4 | Data dictionary documentation |
Sprint 4 (Week 6): Stretch — Admin Page Scraper
| Task | Description |
|---|---|
| 4.1 | Curate UD admin office URLs: Office of the President, Provost, VP units, college dean pages |
| 4.2 | Scraper prototype: requests + BeautifulSoup for UD staff directory pages |
| 4.3 | Headcount extraction: parse pages for staff names/counts, load raw_admin_headcount |
| 4.4 | Document limitations and accuracy |
Dependency Graph
Sprint 1: IPEDS ──────────────────┐
(provides UNITID-EIN crosswalk) │
▼
Sprint 2: IRS 990 ────────────────┐
(uses EIN crosswalk to filter) │
│
Sprint 3: BLS CPI-U (independent) │
+ CLI + Validation ─────────────┤
▼
Sprint 4: Stretch scraper ────── Phase 1 Complete
Technical Decisions
| Decision | Choice | Rationale |
|---|---|---|
| Database | DuckDB | Zero-config, fast analytical queries, native CSV/Parquet, single-file portable. Migrate to PostgreSQL in Phase 3. |
| Package manager | uv | Fast, modern, handles virtualenvs and lockfiles. |
| 990 XML parsing | Evaluate IRSx first, fallback to custom lxml + Master Concordance File | IRSx handles schema versioning. If unmaintained or incomplete for Schedule J, build custom. |
| IPEDS variable mapping | Map to canonical names at ingest | Keeps DB schema stable across years. Raw files stay on disk. |
| 990 download strategy | Filter to UD Foundation EIN(s) from index files | First iteration is a single institution; broader filtering deferred. |
| BLS data | Flat-file download | Simpler than API, no rate limits, single file covers full history. |
| Admin scraper | requests + BeautifulSoup | Scrapy is overkill for 20-30 targeted sites. |
Database Schema
raw_institution (from IPEDS HD)
unitid(PK),ein,institution_name,city,state,sector,control,carnegie_class,enrollment_total,year
raw_990_filing (IRS 990 header)
object_id(PK),ein,tax_year,organization_name,return_type,filing_date,total_revenue,total_expenses,total_assets
raw_990_schedule_j (one row per person per filing)
id(PK),object_id(FK),ein,tax_year,person_name,title,base_compensation,bonus_compensation,other_compensation,deferred_compensation,nontaxable_benefits,total_compensation,compensation_from_related
raw_990_part_vii (one row per person per filing)
id(PK),object_id(FK),ein,tax_year,person_name,title,avg_hours_per_week,reportable_comp_from_org,reportable_comp_from_related,other_compensation
raw_ipeds_finance (one row per institution per year)
unitid,year(composite PK),reporting_standard,total_expenses,instruction_expenses,research_expenses,public_service_expenses,academic_support_expenses,student_services_expenses,institutional_support_expenses,auxiliary_expenses,hospital_expenses,other_expenses,salaries_wages,benefits
raw_ipeds_staff
unitid,year(composite PK),total_staff,faculty_total,management_total
raw_cpi_u
year,month(composite PK),value,series_id
raw_admin_headcount (stretch)
id(PK),unitid,institution_name,admin_unit,page_url,scrape_date,staff_count,staff_names
Key Libraries
lxml— XML parsing for 990 filingsduckdb— database enginehttpxorrequests— HTTP downloads and BLS APIpolarsorpandas— CSV processing for IPEDStyperorclick— CLI frameworkbeautifulsoup4— stretch goal scraperirsx— evaluate for 990 XML parsingpytest— testing
Risks
| Risk | Mitigation |
|---|---|
| IRS XML schema variations break parsing | Use Master Concordance File or IRSx. Test fixtures from multiple schema years. |
| IPEDS variable names change year to year | Always parse the data dictionary alongside each file. |
| Bulk 990 download too large | First iteration is UD only — minimal download. Design for broader filtering in later iteration. |
| IRSx library unmaintained | Evaluate early in Sprint 2. Fallback: custom lxml parser with concordance mappings. |
| Title normalization unreliable | Start with high-confidence exact/regex matches. Flag ambiguous titles for manual review. |
References
- IRS Form 990 Series Downloads
- Master Concordance File
- IRSx 990-xml-reader
- Form 990 XML Schema Mapper
- Schedule J Instructions
- IPEDS Data Center
- Urban Institute IPEDS Scraper
- BLS API v2
- BLS CPI Series IDs
Generated by Claude · Administrative Analytics Project