Initial project planning docs for UD administrative analytics
- Project scope document (v0.1): objectives, data sources, key metrics, phases - Phase 1 implementation plan: IPEDS, IRS 990, BLS CPI-U acquisition for UD - CLAUDE.md: project context and conventions Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
commit
f037c50736
3 changed files with 390 additions and 0 deletions
205
phase1_plan.md
Normal file
205
phase1_plan.md
Normal file
|
|
@ -0,0 +1,205 @@
|
|||
# Phase 1 — Data Acquisition: Implementation Plan
|
||||
|
||||
**Version:** 0.1 | **Status:** Draft | **Date:** March 2026
|
||||
|
||||
---
|
||||
|
||||
## Summary
|
||||
|
||||
Phase 1 builds the data ingestion layer for three primary sources (IRS 990 XML, IPEDS CSV, BLS CPI-U) and a stretch-goal web scraper, scoped to the **University of Delaware** only. Peer/AAU/multi-institution comparisons are deferred to a later iteration. The deliverable is a populated DuckDB database with UD's raw data ready for Phase 2 normalization.
|
||||
|
||||
---
|
||||
|
||||
## Data Source Details
|
||||
|
||||
### IRS 990 Bulk XML
|
||||
|
||||
**Location:** `https://apps.irs.gov/pub/epostcard/990/xml/{YEAR}/`
|
||||
|
||||
**Key extractions:**
|
||||
- Filing header: EIN, tax year, org name, total revenue/expenses/assets
|
||||
- Part VII: per-person compensation summary (name, title, hours, reportable comp)
|
||||
- **Schedule J**: detailed compensation breakdown per person — base, bonus, other, deferred, nontaxable benefits, total
|
||||
|
||||
**Schema considerations:**
|
||||
- XML element names vary across tax years. The [Master Concordance File](https://nonprofit-open-data-collective.github.io/irs-efile-master-concordance-file/) maps field variations.
|
||||
- [IRSx (990-xml-reader)](https://github.com/jsfenfen/990-xml-reader) handles versioned XPath differences — evaluate first, fallback to custom lxml parser.
|
||||
- Only private/nonprofit universities file 990s. UD is a public university and does **not** file a 990. However, the University of Delaware Foundation (a separate nonprofit) does file a 990 — this is the source for UD executive compensation and philanthropic data.
|
||||
- For the first iteration, filter to UD/UD Foundation EINs only. Broader higher-ed filtering (NTEE codes B40-B43, EIN crosswalk) is deferred to the multi-institution iteration.
|
||||
|
||||
### IPEDS Bulk CSV
|
||||
|
||||
**Location:** `https://nces.ed.gov/ipeds/datacenter/` — Complete Data Files
|
||||
|
||||
**Key survey components:**
|
||||
|
||||
| Component | What it provides |
|
||||
|-----------|-----------------|
|
||||
| HD (Directory) | UNITID, name, EIN, Carnegie classification, sector, control |
|
||||
| F1A / F2 (Finance) | Expenses by function — instruction, research, academic support, **institutional support (admin)**, etc. |
|
||||
| S / SAL (HR) | Staff counts by occupational category, faculty counts, salary outlays |
|
||||
| EF (Enrollment) | Student headcounts for per-student cost calculations |
|
||||
|
||||
**Note:** Variable names change across years. Always parse the accompanying data dictionary; never hardcode variable names.
|
||||
|
||||
### BLS CPI-U
|
||||
|
||||
**Series ID:** `CUUR0000SA0` — CPI-U, Not Seasonally Adjusted, U.S. City Average, All Items
|
||||
|
||||
**Preferred method:** Flat-file download from `https://download.bls.gov/pub/time.series/cu/cu.data.0.Current` (simpler than API, no rate limits, full history in one file).
|
||||
|
||||
**Alternative:** BLS API v2 at `https://api.bls.gov/publicAPI/v2/timeseries/data/` (requires free registration key).
|
||||
|
||||
### Stretch: Admin Office Web Pages
|
||||
|
||||
First iteration targets **University of Delaware only**. Curate URLs for UD's Office of the President, Provost, VP units, and college admin pages. Use `requests` + `BeautifulSoup` to extract staff directory headcounts. Expand to 20-30 peer institutions in a later iteration.
|
||||
|
||||
---
|
||||
|
||||
## Sprint Plan
|
||||
|
||||
### Sprint 1 (Weeks 1-2): Foundation + IPEDS
|
||||
|
||||
IPEDS goes first — it provides UD's UNITID and institutional context. All IPEDS downloads can pull full files but loading is filtered to **UD's UNITID only**.
|
||||
|
||||
| Task | Description |
|
||||
|------|-------------|
|
||||
| 1.1 | Project scaffolding: `pyproject.toml`, directory structure, DuckDB setup, CLI skeleton |
|
||||
| 1.2 | IPEDS Downloader: download HD complete data files for 2005-2024, parse data dictionaries |
|
||||
| 1.3 | IPEDS Institution Loader: parse HD files, filter to UD's UNITID, load `raw_institution` |
|
||||
| 1.4 | IPEDS Finance Parser: download and parse F1A (GASB — UD is public) finance files for UD, map year-varying variable names to canonical columns |
|
||||
| 1.5 | IPEDS HR Parser: download and parse Fall Staff / Salaries survey files for UD |
|
||||
| 1.6 | IPEDS Enrollment: parse EF for UD total headcount |
|
||||
| 1.7 | Tests for all IPEDS parsers using fixture files |
|
||||
|
||||
### Sprint 2 (Weeks 3-4): IRS 990
|
||||
|
||||
| Task | Description |
|
||||
|------|-------------|
|
||||
| 2.1 | Identify UD-related EINs: University of Delaware Foundation and any other UD-affiliated nonprofit entities that file 990s |
|
||||
| 2.2 | 990 Downloader: download yearly index CSVs, filter to UD-related EINs only, download matching XML files |
|
||||
| 2.3 | 990 XML Parser — Filing Header: extract EIN, tax year, org name, return type, financials |
|
||||
| 2.4 | 990 XML Parser — Part VII: extract per-person compensation summaries, handle XPath variations |
|
||||
| 2.5 | 990 XML Parser — Schedule J: extract detailed compensation breakdowns (base, bonus, other, deferred, nontaxable, total) |
|
||||
| 2.6 | Title Normalization: regex patterns + manual mapping to bucket titles into canonical roles (President, Provost, VP, Dean, CFO, etc.) |
|
||||
| 2.7 | Tests with fixture XML files covering different schema years |
|
||||
|
||||
### Sprint 3 (Week 5): BLS CPI-U + Integration
|
||||
|
||||
| Task | Description |
|
||||
|------|-------------|
|
||||
| 3.1 | BLS CPI-U Fetcher: flat-file download of `cu.data.0.Current`, load `raw_cpi_u` |
|
||||
| 3.2 | CLI wiring: `ingest ipeds`, `ingest irs990`, `ingest cpi`, `ingest all` with `--year-range` and `--force-redownload` flags |
|
||||
| 3.3 | Data validation queries: row counts, NULL rates, year coverage, cross-source consistency |
|
||||
| 3.4 | Data dictionary documentation |
|
||||
|
||||
### Sprint 4 (Week 6): Stretch — Admin Page Scraper
|
||||
|
||||
| Task | Description |
|
||||
|------|-------------|
|
||||
| 4.1 | Curate UD admin office URLs: Office of the President, Provost, VP units, college dean pages |
|
||||
| 4.2 | Scraper prototype: `requests` + `BeautifulSoup` for UD staff directory pages |
|
||||
| 4.3 | Headcount extraction: parse pages for staff names/counts, load `raw_admin_headcount` |
|
||||
| 4.4 | Document limitations and accuracy |
|
||||
|
||||
---
|
||||
|
||||
## Dependency Graph
|
||||
|
||||
```
|
||||
Sprint 1: IPEDS ──────────────────┐
|
||||
(provides UNITID-EIN crosswalk) │
|
||||
▼
|
||||
Sprint 2: IRS 990 ────────────────┐
|
||||
(uses EIN crosswalk to filter) │
|
||||
│
|
||||
Sprint 3: BLS CPI-U (independent) │
|
||||
+ CLI + Validation ─────────────┤
|
||||
▼
|
||||
Sprint 4: Stretch scraper ────── Phase 1 Complete
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Technical Decisions
|
||||
|
||||
| Decision | Choice | Rationale |
|
||||
|----------|--------|-----------|
|
||||
| Database | DuckDB | Zero-config, fast analytical queries, native CSV/Parquet, single-file portable. Migrate to PostgreSQL in Phase 3. |
|
||||
| Package manager | uv | Fast, modern, handles virtualenvs and lockfiles. |
|
||||
| 990 XML parsing | Evaluate IRSx first, fallback to custom lxml + Master Concordance File | IRSx handles schema versioning. If unmaintained or incomplete for Schedule J, build custom. |
|
||||
| IPEDS variable mapping | Map to canonical names at ingest | Keeps DB schema stable across years. Raw files stay on disk. |
|
||||
| 990 download strategy | Filter to UD Foundation EIN(s) from index files | First iteration is a single institution; broader filtering deferred. |
|
||||
| BLS data | Flat-file download | Simpler than API, no rate limits, single file covers full history. |
|
||||
| Admin scraper | requests + BeautifulSoup | Scrapy is overkill for 20-30 targeted sites. |
|
||||
|
||||
---
|
||||
|
||||
## Database Schema
|
||||
|
||||
### raw_institution (from IPEDS HD)
|
||||
- `unitid` (PK), `ein`, `institution_name`, `city`, `state`, `sector`, `control`, `carnegie_class`, `enrollment_total`, `year`
|
||||
|
||||
### raw_990_filing (IRS 990 header)
|
||||
- `object_id` (PK), `ein`, `tax_year`, `organization_name`, `return_type`, `filing_date`, `total_revenue`, `total_expenses`, `total_assets`
|
||||
|
||||
### raw_990_schedule_j (one row per person per filing)
|
||||
- `id` (PK), `object_id` (FK), `ein`, `tax_year`, `person_name`, `title`, `base_compensation`, `bonus_compensation`, `other_compensation`, `deferred_compensation`, `nontaxable_benefits`, `total_compensation`, `compensation_from_related`
|
||||
|
||||
### raw_990_part_vii (one row per person per filing)
|
||||
- `id` (PK), `object_id` (FK), `ein`, `tax_year`, `person_name`, `title`, `avg_hours_per_week`, `reportable_comp_from_org`, `reportable_comp_from_related`, `other_compensation`
|
||||
|
||||
### raw_ipeds_finance (one row per institution per year)
|
||||
- `unitid`, `year` (composite PK), `reporting_standard`, `total_expenses`, `instruction_expenses`, `research_expenses`, `public_service_expenses`, `academic_support_expenses`, `student_services_expenses`, `institutional_support_expenses`, `auxiliary_expenses`, `hospital_expenses`, `other_expenses`, `salaries_wages`, `benefits`
|
||||
|
||||
### raw_ipeds_staff
|
||||
- `unitid`, `year` (composite PK), `total_staff`, `faculty_total`, `management_total`
|
||||
|
||||
### raw_cpi_u
|
||||
- `year`, `month` (composite PK), `value`, `series_id`
|
||||
|
||||
### raw_admin_headcount (stretch)
|
||||
- `id` (PK), `unitid`, `institution_name`, `admin_unit`, `page_url`, `scrape_date`, `staff_count`, `staff_names`
|
||||
|
||||
---
|
||||
|
||||
## Key Libraries
|
||||
|
||||
- `lxml` — XML parsing for 990 filings
|
||||
- `duckdb` — database engine
|
||||
- `httpx` or `requests` — HTTP downloads and BLS API
|
||||
- `polars` or `pandas` — CSV processing for IPEDS
|
||||
- `typer` or `click` — CLI framework
|
||||
- `beautifulsoup4` — stretch goal scraper
|
||||
- `irsx` — evaluate for 990 XML parsing
|
||||
- `pytest` — testing
|
||||
|
||||
---
|
||||
|
||||
## Risks
|
||||
|
||||
| Risk | Mitigation |
|
||||
|------|-----------|
|
||||
| IRS XML schema variations break parsing | Use Master Concordance File or IRSx. Test fixtures from multiple schema years. |
|
||||
| IPEDS variable names change year to year | Always parse the data dictionary alongside each file. |
|
||||
| Bulk 990 download too large | First iteration is UD only — minimal download. Design for broader filtering in later iteration. |
|
||||
| IRSx library unmaintained | Evaluate early in Sprint 2. Fallback: custom lxml parser with concordance mappings. |
|
||||
| Title normalization unreliable | Start with high-confidence exact/regex matches. Flag ambiguous titles for manual review. |
|
||||
|
||||
---
|
||||
|
||||
## References
|
||||
|
||||
- [IRS Form 990 Series Downloads](https://www.irs.gov/charities-non-profits/form-990-series-downloads)
|
||||
- [Master Concordance File](https://nonprofit-open-data-collective.github.io/irs-efile-master-concordance-file/)
|
||||
- [IRSx 990-xml-reader](https://github.com/jsfenfen/990-xml-reader)
|
||||
- [Form 990 XML Schema Mapper](https://github.com/Giving-Tuesday/form-990-xml-mapper)
|
||||
- [Schedule J Instructions](https://www.irs.gov/instructions/i990sj)
|
||||
- [IPEDS Data Center](https://nces.ed.gov/ipeds/use-the-data)
|
||||
- [Urban Institute IPEDS Scraper](https://github.com/UrbanInstitute/ipeds-scraper)
|
||||
- [BLS API v2](https://www.bls.gov/developers/api_signature_v2.htm)
|
||||
- [BLS CPI Series IDs](https://www.bls.gov/cpi/factsheets/cpi-series-ids.htm)
|
||||
|
||||
---
|
||||
|
||||
*Generated by Claude · Administrative Analytics Project*
|
||||
Loading…
Add table
Add a link
Reference in a new issue