AdminAnalytics/phase1_plan.md
2026-03-30 19:29:33 -04:00

220 lines
11 KiB
Markdown

# Phase 1 — Data Acquisition: Implementation Plan
**Version:** 0.1 | **Status:** Draft | **Date:** March 2026
---
## Summary
Phase 1 builds the data ingestion layer for three primary sources (IRS 990 XML, IPEDS CSV, BLS CPI-U) and a stretch-goal web scraper, scoped to the **University of Delaware** only. Peer/AAU/multi-institution comparisons are deferred to a later iteration. The deliverable is a populated DuckDB database with UD's raw data ready for Phase 2 normalization.
---
## Data Source Details
### IRS 990 Bulk XML
**Location:** `https://apps.irs.gov/pub/epostcard/990/xml/{YEAR}/`
**Key extractions:**
- Filing header: EIN, tax year, org name, total revenue/expenses/assets
- Part VII: per-person compensation summary (name, title, hours, reportable comp)
- **Schedule J**: detailed compensation breakdown per person — base, bonus, other, deferred, nontaxable benefits, total
**Schema considerations:**
- XML element names vary across tax years. The [Master Concordance File](https://nonprofit-open-data-collective.github.io/irs-efile-master-concordance-file/) maps field variations.
- [IRSx (990-xml-reader)](https://github.com/jsfenfen/990-xml-reader) handles versioned XPath differences — evaluate first, fallback to custom lxml parser.
- Only private/nonprofit universities file 990s. UD is a public university and does **not** file a 990. However, the University of Delaware Foundation (a separate nonprofit) does file a 990 — this is the source for UD executive compensation and philanthropic data.
- For the first iteration, filter to UD/UD Foundation EINs only. Broader higher-ed filtering (NTEE codes B40-B43, EIN crosswalk) is deferred to the multi-institution iteration.
### IPEDS Bulk CSV
**Location:** `https://nces.ed.gov/ipeds/datacenter/` — Complete Data Files
**Key survey components:**
| Component | What it provides |
|-----------|-----------------|
| HD (Directory) | UNITID, name, EIN, Carnegie classification, sector, control |
| F1A / F2 (Finance) | Expenses by function — instruction, research, academic support, **institutional support (admin)**, etc. |
| S / SAL (HR) | Staff counts by occupational category, faculty counts, salary outlays |
| EF (Enrollment) | Student headcounts for per-student cost calculations |
**Note:** Variable names change across years. Always parse the accompanying data dictionary; never hardcode variable names.
### BLS CPI-U
**Series ID:** `CUUR0000SA0` — CPI-U, Not Seasonally Adjusted, U.S. City Average, All Items
**Preferred method:** Flat-file download from `https://download.bls.gov/pub/time.series/cu/cu.data.0.Current` (simpler than API, no rate limits, full history in one file).
**Alternative:** BLS API v2 at `https://api.bls.gov/publicAPI/v2/timeseries/data/` (requires free registration key).
### Stretch: Admin Office Web Pages
First iteration targets **University of Delaware only**. Curate URLs for UD's Office of the President, Provost, VP units, and college admin pages. Use `requests` + `BeautifulSoup` to extract staff directory headcounts. Expand to 20-30 peer institutions in a later iteration.
---
## Sprint Plan
### Sprint 1 (Weeks 1-2): Foundation + IPEDS
IPEDS goes first — it provides UD's UNITID and institutional context. All IPEDS downloads can pull full files but loading is filtered to **UD's UNITID only**.
| Task | Description |
|------|-------------|
| 1.1 | Project scaffolding: `pyproject.toml`, directory structure, DuckDB setup, CLI skeleton |
| 1.2 | IPEDS Downloader: download HD complete data files for 2005-2024, parse data dictionaries |
| 1.3 | IPEDS Institution Loader: parse HD files, filter to UD's UNITID, load `raw_institution` |
| 1.4 | IPEDS Finance Parser: download and parse F1A (GASB — UD is public) finance files for UD, map year-varying variable names to canonical columns |
| 1.5 | IPEDS HR Parser: download and parse Fall Staff / Salaries survey files for UD |
| 1.6 | IPEDS Enrollment: parse EF for UD total headcount |
| 1.7 | Tests for all IPEDS parsers using fixture files |
### Sprint 2 (Weeks 3-4): IRS 990
| Task | Description |
|------|-------------|
| 2.1 | Identify UD-related EINs: University of Delaware Foundation and any other UD-affiliated nonprofit entities that file 990s |
| 2.2 | 990 Downloader: download yearly index CSVs, filter to UD-related EINs only, download matching XML files |
| 2.3 | 990 XML Parser — Filing Header: extract EIN, tax year, org name, return type, financials |
| 2.4 | 990 XML Parser — Part VII: extract per-person compensation summaries, handle XPath variations |
| 2.5 | 990 XML Parser — Schedule J: extract detailed compensation breakdowns (base, bonus, other, deferred, nontaxable, total) |
| 2.6 | Title Normalization: regex patterns + manual mapping to bucket titles into canonical roles (President, Provost, VP, Dean, CFO, etc.) |
| 2.7 | Tests with fixture XML files covering different schema years |
### Sprint 3 (Week 5): BLS CPI-U + Integration
| Task | Description |
|------|-------------|
| 3.1 | BLS CPI-U Fetcher: flat-file download of `cu.data.0.Current`, load `raw_cpi_u` |
| 3.2 | CLI wiring: `ingest ipeds`, `ingest irs990`, `ingest cpi`, `ingest all` with `--year-range` and `--force-redownload` flags |
| 3.3 | Data validation queries: row counts, NULL rates, year coverage, cross-source consistency |
| 3.4 | Data dictionary documentation |
### Sprint 4 (Week 6): Stretch — Admin Page Scraper
| Task | Description |
|------|-------------|
| 4.1 | Curate UD admin office URLs: Office of the President, Provost, VP units, college dean pages |
| 4.2 | Scraper prototype: `requests` + `BeautifulSoup` for UD staff directory pages |
| 4.3 | Headcount extraction: parse pages for staff names/counts, load `raw_admin_headcount` |
| 4.4 | Document limitations and accuracy |
---
## Dependency Graph
```
Sprint 1: IPEDS ──────────────────┐
(provides UNITID-EIN crosswalk) │
Sprint 2: IRS 990 ────────────────┐
(uses EIN crosswalk to filter) │
Sprint 3: BLS CPI-U (independent) │
+ CLI + Validation ─────────────┤
Sprint 4: Stretch scraper ────── Phase 1 Complete
Phase 2: SKIPPED (folded into dashboard queries)
Sprint 5: Dash dashboard ──────── Phase 3 Prototype
```
---
## Phase 2 Skip Decision (March 2026)
Phase 2 (data pipeline & normalization) was **skipped** for the initial prototype. All derived metrics — admin cost ratios, CPI adjustments, compensation growth indices — are computed directly in dashboard SQL queries rather than a separate normalized schema.
**Rationale:** With a single institution (UD) and a populated DuckDB, the query layer is sufficient for a local prototype. A proper Phase 2 with dbt transformations and a unified analytical schema should be built before:
- Expanding to multi-institution comparisons (Phase 4)
- Moving to a production React dashboard (Phase 3 full build)
- Adding complex cross-source joins that benefit from materialized views
---
## Technical Decisions
| Decision | Choice | Rationale |
|----------|--------|-----------|
| Database | DuckDB | Zero-config, fast analytical queries, native CSV/Parquet, single-file portable. Migrate to PostgreSQL in Phase 3. |
| Package manager | uv | Fast, modern, handles virtualenvs and lockfiles. |
| 990 XML parsing | Evaluate IRSx first, fallback to custom lxml + Master Concordance File | IRSx handles schema versioning. If unmaintained or incomplete for Schedule J, build custom. |
| IPEDS variable mapping | Map to canonical names at ingest | Keeps DB schema stable across years. Raw files stay on disk. |
| 990 download strategy | Filter to UD Foundation EIN(s) from index files | First iteration is a single institution; broader filtering deferred. |
| BLS data | Flat-file download | Simpler than API, no rate limits, single file covers full history. |
| Admin scraper | requests + BeautifulSoup | Scrapy is overkill for 20-30 targeted sites. |
---
## Database Schema
### raw_institution (from IPEDS HD)
- `unitid` (PK), `ein`, `institution_name`, `city`, `state`, `sector`, `control`, `carnegie_class`, `enrollment_total`, `year`
### raw_990_filing (IRS 990 header)
- `object_id` (PK), `ein`, `tax_year`, `organization_name`, `return_type`, `filing_date`, `total_revenue`, `total_expenses`, `total_assets`
### raw_990_schedule_j (one row per person per filing)
- `id` (PK), `object_id` (FK), `ein`, `tax_year`, `person_name`, `title`, `base_compensation`, `bonus_compensation`, `other_compensation`, `deferred_compensation`, `nontaxable_benefits`, `total_compensation`, `compensation_from_related`
### raw_990_part_vii (one row per person per filing)
- `id` (PK), `object_id` (FK), `ein`, `tax_year`, `person_name`, `title`, `avg_hours_per_week`, `reportable_comp_from_org`, `reportable_comp_from_related`, `other_compensation`
### raw_ipeds_finance (one row per institution per year)
- `unitid`, `year` (composite PK), `reporting_standard`, `total_expenses`, `instruction_expenses`, `research_expenses`, `public_service_expenses`, `academic_support_expenses`, `student_services_expenses`, `institutional_support_expenses`, `auxiliary_expenses`, `hospital_expenses`, `other_expenses`, `salaries_wages`, `benefits`
### raw_ipeds_staff
- `unitid`, `year` (composite PK), `total_staff`, `faculty_total`, `management_total`
### raw_cpi_u
- `year`, `month` (composite PK), `value`, `series_id`
### raw_admin_headcount (stretch)
- `id` (PK), `unitid`, `institution_name`, `admin_unit`, `page_url`, `scrape_date`, `staff_count`, `staff_names`
---
## Key Libraries
- `lxml` — XML parsing for 990 filings
- `duckdb` — database engine
- `httpx` or `requests` — HTTP downloads and BLS API
- `polars` or `pandas` — CSV processing for IPEDS
- `typer` or `click` — CLI framework
- `beautifulsoup4` — stretch goal scraper
- `irsx` — evaluate for 990 XML parsing
- `pytest` — testing
---
## Risks
| Risk | Mitigation |
|------|-----------|
| IRS XML schema variations break parsing | Use Master Concordance File or IRSx. Test fixtures from multiple schema years. |
| IPEDS variable names change year to year | Always parse the data dictionary alongside each file. |
| Bulk 990 download too large | First iteration is UD only — minimal download. Design for broader filtering in later iteration. |
| IRSx library unmaintained | Evaluate early in Sprint 2. Fallback: custom lxml parser with concordance mappings. |
| Title normalization unreliable | Start with high-confidence exact/regex matches. Flag ambiguous titles for manual review. |
---
## References
- [IRS Form 990 Series Downloads](https://www.irs.gov/charities-non-profits/form-990-series-downloads)
- [Master Concordance File](https://nonprofit-open-data-collective.github.io/irs-efile-master-concordance-file/)
- [IRSx 990-xml-reader](https://github.com/jsfenfen/990-xml-reader)
- [Form 990 XML Schema Mapper](https://github.com/Giving-Tuesday/form-990-xml-mapper)
- [Schedule J Instructions](https://www.irs.gov/instructions/i990sj)
- [IPEDS Data Center](https://nces.ed.gov/ipeds/use-the-data)
- [Urban Institute IPEDS Scraper](https://github.com/UrbanInstitute/ipeds-scraper)
- [BLS API v2](https://www.bls.gov/developers/api_signature_v2.htm)
- [BLS CPI Series IDs](https://www.bls.gov/cpi/factsheets/cpi-series-ids.htm)
---
*Generated by Claude · Administrative Analytics Project*