Initial project planning docs for UD administrative analytics
- Project scope document (v0.1): objectives, data sources, key metrics, phases - Phase 1 implementation plan: IPEDS, IRS 990, BLS CPI-U acquisition for UD - CLAUDE.md: project context and conventions Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
commit
f037c50736
3 changed files with 390 additions and 0 deletions
67
CLAUDE.md
Normal file
67
CLAUDE.md
Normal file
|
|
@ -0,0 +1,67 @@
|
|||
# Admin Analytics
|
||||
|
||||
University administrative cost benchmarking project using public data (IRS 990, IPEDS, BLS CPI-U). **First iteration is scoped to the University of Delaware only.** Peer/AAU/multi-institution comparisons are planned for a later iteration.
|
||||
|
||||
## Project status
|
||||
|
||||
Currently in planning. Phase 1 (Data Acquisition) is planned but not yet built. See `phase1_plan.md` for the full implementation plan and `administrative_analytics_scope_v0.1.md` for project scope.
|
||||
|
||||
## Architecture
|
||||
|
||||
- **Language:** Python
|
||||
- **Database:** DuckDB for Phase 1 (single-file, zero-config). Migrate to PostgreSQL in Phase 3 when the dashboard needs concurrent access.
|
||||
- **Package manager:** uv
|
||||
- **CLI framework:** typer or click (TBD)
|
||||
- **Testing:** pytest
|
||||
|
||||
## Data sources
|
||||
|
||||
| Source | Format | What we extract |
|
||||
|--------|--------|-----------------|
|
||||
| IRS 990 bulk XML | XML (versioned schemas) | Filing financials, Part VII compensation, Schedule J detailed compensation |
|
||||
| IPEDS | CSV bulk downloads | Institution directory (HD), finance by function (F1A/F2), staffing (S/SAL), enrollment (EF) |
|
||||
| BLS CPI-U | Flat file or API | Consumer Price Index for inflation-adjusted compensation analysis |
|
||||
| Admin office web pages (stretch) | HTML scraping | Staff directory headcounts |
|
||||
|
||||
## Key concepts
|
||||
|
||||
- **University of Delaware** is the sole target institution for the first iteration. UD's IPEDS UNITID is the anchor for all IPEDS queries.
|
||||
- **UD is a public university** and does not file an IRS 990. However, the **University of Delaware Foundation** (a separate nonprofit) does file a 990 — this is the source for executive compensation (Schedule J) and philanthropic data.
|
||||
- **UNITID** is the canonical institution identifier (from IPEDS). All cross-source linking flows through UNITID.
|
||||
- **EIN** links to IRS 990 filings. For the first iteration, only UD Foundation EIN(s) are needed. A broader UNITID-to-EIN crosswalk will be built when expanding to peer institutions.
|
||||
- IRS 990 XML schemas change across tax years. Use the Master Concordance File or IRSx library to handle XPath variations.
|
||||
- IPEDS variable names change across years. Always parse the accompanying data dictionary; never hardcode variable names.
|
||||
|
||||
## Planned project structure
|
||||
|
||||
```
|
||||
src/admin_analytics/
|
||||
config.py
|
||||
cli.py
|
||||
db/ # DuckDB schema and connection
|
||||
irs990/ # 990 download, XML parsing, Schedule J extraction, university filtering
|
||||
ipeds/ # IPEDS download, dictionary parsing, finance/HR/enrollment loading
|
||||
bls/ # CPI-U fetcher and loader
|
||||
scraper/ # Stretch: admin office headcount scraper
|
||||
data/raw/ # Downloaded files (gitignored)
|
||||
tests/
|
||||
fixtures/ # Sample XML/CSV files for tests
|
||||
```
|
||||
|
||||
## Build & run
|
||||
|
||||
Not yet implemented. When built, the CLI will support:
|
||||
```
|
||||
admin-analytics ingest ipeds --year-range 2005-2024
|
||||
admin-analytics ingest irs990 --year-range 2005-2024
|
||||
admin-analytics ingest cpi
|
||||
admin-analytics ingest all
|
||||
```
|
||||
|
||||
## Conventions
|
||||
|
||||
- Raw data tables are prefixed with `raw_` (e.g., `raw_institution`, `raw_990_schedule_j`)
|
||||
- Downloaded files go in `data/raw/` and are gitignored
|
||||
- IPEDS variables are mapped to canonical column names at ingest time; raw CSVs stay on disk for reprocessing
|
||||
- First iteration filters all data to UD/UD Foundation only. Design parsers to accept institution filters so they can scale to multi-institution in a later iteration
|
||||
- 990 downloads are filtered by EIN from index files to avoid downloading the full archive (hundreds of GB)
|
||||
118
administrative_analytics_scope_v0.1.md
Normal file
118
administrative_analytics_scope_v0.1.md
Normal file
|
|
@ -0,0 +1,118 @@
|
|||
# Administrative Analytics — Project Scope
|
||||
|
||||
**Version:** 0.1 | **Status:** Draft | **Date:** March 2026
|
||||
|
||||
---
|
||||
|
||||
## Problem statement
|
||||
|
||||
University administrative costs have grown significantly over the past two decades, yet institutions lack easy tools to benchmark their administrative spending against peer institutions or correlate those costs with performance outcomes. This project aims to close that gap using publicly available data.
|
||||
|
||||
---
|
||||
|
||||
## Objectives
|
||||
|
||||
Build a data pipeline and analytics dashboard that aggregates public data on university administrative costs, benchmarks institutions against peers, and surfaces correlations between administrative spend and key performance indicators such as fundraising revenue.
|
||||
|
||||
**First iteration scope:** Data acquisition and analysis will focus exclusively on the **University of Delaware**. Peer institution comparisons (AAU members, Carnegie peers, etc.) will be added in a later iteration.
|
||||
|
||||
---
|
||||
|
||||
## Data sources
|
||||
|
||||
### Primary
|
||||
- **IRS Form 990** — private universities and non-profits
|
||||
- **IPEDS** (Integrated Postsecondary Education Data System) — all institutions, including public universities
|
||||
- **NACUBO endowment study reports**
|
||||
|
||||
### Secondary
|
||||
- **BLS CPI-U data** — Consumer Price Index for All Urban Consumers, for inflation-adjusted compensation analysis
|
||||
- University philanthropy and fundraising reports (public fact books)
|
||||
- Chronicle of Higher Education data
|
||||
- Public institutional fact books
|
||||
|
||||
### Stretch
|
||||
- **Institutional administrative office web pages** — scrape Provost, President, VP unit, and college administration pages for staff directories / headcount tracking
|
||||
|
||||
---
|
||||
|
||||
## Key metrics
|
||||
|
||||
### Cost metrics
|
||||
- Admin cost per student
|
||||
- Admin-to-faculty ratio
|
||||
- Administrative spending as % of total expenses
|
||||
|
||||
### Compensation metrics
|
||||
- Key employee salaries from IRS 990 Schedule J (President, Provost, VPs, Deans, etc.)
|
||||
- Year-over-year compensation growth per position
|
||||
- Compensation growth vs. CPI-U (Bureau of Labor Statistics)
|
||||
|
||||
### Performance metrics
|
||||
- Philanthropic revenue raised
|
||||
- Endowment growth year-over-year
|
||||
- Grant funding secured
|
||||
|
||||
### Benchmarking (later iteration)
|
||||
- Peer institution comparisons by Carnegie classification, size, and public/private status
|
||||
- AAU institution comparisons
|
||||
|
||||
### Trends
|
||||
- Year-over-year cost and performance trajectories (5–10 year views)
|
||||
|
||||
---
|
||||
|
||||
## Phases
|
||||
|
||||
### Phase 1 — Data acquisition
|
||||
Build parsers for IRS 990 filings (including Schedule J key employee compensation) and IPEDS data for the **University of Delaware** only. Ingest BLS CPI-U series for inflation benchmarking. Establish a raw data store. **Stretch:** prototype scraper for UD administrative office web pages to track headcount.
|
||||
|
||||
### Phase 2 — Data pipeline & normalization
|
||||
Clean, normalize, and reconcile data across sources. Define a unified schema. Build institution-matching logic to link records across datasets using IPEDS Unit ID as the canonical identifier.
|
||||
|
||||
### Phase 3 — Internal analytics dashboard
|
||||
Build an internal tool for our institution to explore cost and performance data. Validate findings with stakeholders.
|
||||
|
||||
### Phase 4 — Multi-institution expansion
|
||||
Extend data acquisition to peer institutions (AAU members, Carnegie peers, etc.). Add benchmarking comparisons, configurable peer groups, and export features.
|
||||
|
||||
---
|
||||
|
||||
## Technical approach
|
||||
|
||||
### Data collection
|
||||
- Python scraping with Scrapy / BeautifulSoup for supplementary sources
|
||||
- IRS 990 XML bulk data parser (IRS provides annual bulk downloads)
|
||||
- IPEDS bulk data files (CSV downloads, no scraping required)
|
||||
|
||||
### Data storage & transformation
|
||||
- PostgreSQL or DuckDB as primary data store
|
||||
- dbt for data transformation and modeling
|
||||
|
||||
### Frontend & API
|
||||
- React with a charting library (e.g., Recharts or Observable Plot)
|
||||
- REST or GraphQL API layer
|
||||
|
||||
---
|
||||
|
||||
## Risks & considerations
|
||||
|
||||
| Severity | Risk | Notes |
|
||||
|----------|------|-------|
|
||||
| Medium | Data completeness | Public universities do not file 990s. IPEDS is the primary fallback and provides expense breakdowns by function. |
|
||||
| Medium | Institution matching | Names vary across datasets. Use IPEDS Unit ID as the canonical identifier from the start. |
|
||||
| Low | Rate limiting | IRS and IPEDS data are available as bulk downloads; scraping is mostly not required for core datasets. |
|
||||
| Low | Data licensing | All target sources are public domain or open government data. No licensing barriers anticipated. |
|
||||
|
||||
---
|
||||
|
||||
## Out of scope (v1)
|
||||
|
||||
- Internal financial systems integration
|
||||
- Real-time data feeds
|
||||
- Accreditation or ranking data
|
||||
- Faculty / HR compensation analysis (key executive compensation from 990s *is* in scope)
|
||||
|
||||
---
|
||||
|
||||
*Generated by Claude · Administrative Analytics Project*
|
||||
205
phase1_plan.md
Normal file
205
phase1_plan.md
Normal file
|
|
@ -0,0 +1,205 @@
|
|||
# Phase 1 — Data Acquisition: Implementation Plan
|
||||
|
||||
**Version:** 0.1 | **Status:** Draft | **Date:** March 2026
|
||||
|
||||
---
|
||||
|
||||
## Summary
|
||||
|
||||
Phase 1 builds the data ingestion layer for three primary sources (IRS 990 XML, IPEDS CSV, BLS CPI-U) and a stretch-goal web scraper, scoped to the **University of Delaware** only. Peer/AAU/multi-institution comparisons are deferred to a later iteration. The deliverable is a populated DuckDB database with UD's raw data ready for Phase 2 normalization.
|
||||
|
||||
---
|
||||
|
||||
## Data Source Details
|
||||
|
||||
### IRS 990 Bulk XML
|
||||
|
||||
**Location:** `https://apps.irs.gov/pub/epostcard/990/xml/{YEAR}/`
|
||||
|
||||
**Key extractions:**
|
||||
- Filing header: EIN, tax year, org name, total revenue/expenses/assets
|
||||
- Part VII: per-person compensation summary (name, title, hours, reportable comp)
|
||||
- **Schedule J**: detailed compensation breakdown per person — base, bonus, other, deferred, nontaxable benefits, total
|
||||
|
||||
**Schema considerations:**
|
||||
- XML element names vary across tax years. The [Master Concordance File](https://nonprofit-open-data-collective.github.io/irs-efile-master-concordance-file/) maps field variations.
|
||||
- [IRSx (990-xml-reader)](https://github.com/jsfenfen/990-xml-reader) handles versioned XPath differences — evaluate first, fallback to custom lxml parser.
|
||||
- Only private/nonprofit universities file 990s. UD is a public university and does **not** file a 990. However, the University of Delaware Foundation (a separate nonprofit) does file a 990 — this is the source for UD executive compensation and philanthropic data.
|
||||
- For the first iteration, filter to UD/UD Foundation EINs only. Broader higher-ed filtering (NTEE codes B40-B43, EIN crosswalk) is deferred to the multi-institution iteration.
|
||||
|
||||
### IPEDS Bulk CSV
|
||||
|
||||
**Location:** `https://nces.ed.gov/ipeds/datacenter/` — Complete Data Files
|
||||
|
||||
**Key survey components:**
|
||||
|
||||
| Component | What it provides |
|
||||
|-----------|-----------------|
|
||||
| HD (Directory) | UNITID, name, EIN, Carnegie classification, sector, control |
|
||||
| F1A / F2 (Finance) | Expenses by function — instruction, research, academic support, **institutional support (admin)**, etc. |
|
||||
| S / SAL (HR) | Staff counts by occupational category, faculty counts, salary outlays |
|
||||
| EF (Enrollment) | Student headcounts for per-student cost calculations |
|
||||
|
||||
**Note:** Variable names change across years. Always parse the accompanying data dictionary; never hardcode variable names.
|
||||
|
||||
### BLS CPI-U
|
||||
|
||||
**Series ID:** `CUUR0000SA0` — CPI-U, Not Seasonally Adjusted, U.S. City Average, All Items
|
||||
|
||||
**Preferred method:** Flat-file download from `https://download.bls.gov/pub/time.series/cu/cu.data.0.Current` (simpler than API, no rate limits, full history in one file).
|
||||
|
||||
**Alternative:** BLS API v2 at `https://api.bls.gov/publicAPI/v2/timeseries/data/` (requires free registration key).
|
||||
|
||||
### Stretch: Admin Office Web Pages
|
||||
|
||||
First iteration targets **University of Delaware only**. Curate URLs for UD's Office of the President, Provost, VP units, and college admin pages. Use `requests` + `BeautifulSoup` to extract staff directory headcounts. Expand to 20-30 peer institutions in a later iteration.
|
||||
|
||||
---
|
||||
|
||||
## Sprint Plan
|
||||
|
||||
### Sprint 1 (Weeks 1-2): Foundation + IPEDS
|
||||
|
||||
IPEDS goes first — it provides UD's UNITID and institutional context. All IPEDS downloads can pull full files but loading is filtered to **UD's UNITID only**.
|
||||
|
||||
| Task | Description |
|
||||
|------|-------------|
|
||||
| 1.1 | Project scaffolding: `pyproject.toml`, directory structure, DuckDB setup, CLI skeleton |
|
||||
| 1.2 | IPEDS Downloader: download HD complete data files for 2005-2024, parse data dictionaries |
|
||||
| 1.3 | IPEDS Institution Loader: parse HD files, filter to UD's UNITID, load `raw_institution` |
|
||||
| 1.4 | IPEDS Finance Parser: download and parse F1A (GASB — UD is public) finance files for UD, map year-varying variable names to canonical columns |
|
||||
| 1.5 | IPEDS HR Parser: download and parse Fall Staff / Salaries survey files for UD |
|
||||
| 1.6 | IPEDS Enrollment: parse EF for UD total headcount |
|
||||
| 1.7 | Tests for all IPEDS parsers using fixture files |
|
||||
|
||||
### Sprint 2 (Weeks 3-4): IRS 990
|
||||
|
||||
| Task | Description |
|
||||
|------|-------------|
|
||||
| 2.1 | Identify UD-related EINs: University of Delaware Foundation and any other UD-affiliated nonprofit entities that file 990s |
|
||||
| 2.2 | 990 Downloader: download yearly index CSVs, filter to UD-related EINs only, download matching XML files |
|
||||
| 2.3 | 990 XML Parser — Filing Header: extract EIN, tax year, org name, return type, financials |
|
||||
| 2.4 | 990 XML Parser — Part VII: extract per-person compensation summaries, handle XPath variations |
|
||||
| 2.5 | 990 XML Parser — Schedule J: extract detailed compensation breakdowns (base, bonus, other, deferred, nontaxable, total) |
|
||||
| 2.6 | Title Normalization: regex patterns + manual mapping to bucket titles into canonical roles (President, Provost, VP, Dean, CFO, etc.) |
|
||||
| 2.7 | Tests with fixture XML files covering different schema years |
|
||||
|
||||
### Sprint 3 (Week 5): BLS CPI-U + Integration
|
||||
|
||||
| Task | Description |
|
||||
|------|-------------|
|
||||
| 3.1 | BLS CPI-U Fetcher: flat-file download of `cu.data.0.Current`, load `raw_cpi_u` |
|
||||
| 3.2 | CLI wiring: `ingest ipeds`, `ingest irs990`, `ingest cpi`, `ingest all` with `--year-range` and `--force-redownload` flags |
|
||||
| 3.3 | Data validation queries: row counts, NULL rates, year coverage, cross-source consistency |
|
||||
| 3.4 | Data dictionary documentation |
|
||||
|
||||
### Sprint 4 (Week 6): Stretch — Admin Page Scraper
|
||||
|
||||
| Task | Description |
|
||||
|------|-------------|
|
||||
| 4.1 | Curate UD admin office URLs: Office of the President, Provost, VP units, college dean pages |
|
||||
| 4.2 | Scraper prototype: `requests` + `BeautifulSoup` for UD staff directory pages |
|
||||
| 4.3 | Headcount extraction: parse pages for staff names/counts, load `raw_admin_headcount` |
|
||||
| 4.4 | Document limitations and accuracy |
|
||||
|
||||
---
|
||||
|
||||
## Dependency Graph
|
||||
|
||||
```
|
||||
Sprint 1: IPEDS ──────────────────┐
|
||||
(provides UNITID-EIN crosswalk) │
|
||||
▼
|
||||
Sprint 2: IRS 990 ────────────────┐
|
||||
(uses EIN crosswalk to filter) │
|
||||
│
|
||||
Sprint 3: BLS CPI-U (independent) │
|
||||
+ CLI + Validation ─────────────┤
|
||||
▼
|
||||
Sprint 4: Stretch scraper ────── Phase 1 Complete
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Technical Decisions
|
||||
|
||||
| Decision | Choice | Rationale |
|
||||
|----------|--------|-----------|
|
||||
| Database | DuckDB | Zero-config, fast analytical queries, native CSV/Parquet, single-file portable. Migrate to PostgreSQL in Phase 3. |
|
||||
| Package manager | uv | Fast, modern, handles virtualenvs and lockfiles. |
|
||||
| 990 XML parsing | Evaluate IRSx first, fallback to custom lxml + Master Concordance File | IRSx handles schema versioning. If unmaintained or incomplete for Schedule J, build custom. |
|
||||
| IPEDS variable mapping | Map to canonical names at ingest | Keeps DB schema stable across years. Raw files stay on disk. |
|
||||
| 990 download strategy | Filter to UD Foundation EIN(s) from index files | First iteration is a single institution; broader filtering deferred. |
|
||||
| BLS data | Flat-file download | Simpler than API, no rate limits, single file covers full history. |
|
||||
| Admin scraper | requests + BeautifulSoup | Scrapy is overkill for 20-30 targeted sites. |
|
||||
|
||||
---
|
||||
|
||||
## Database Schema
|
||||
|
||||
### raw_institution (from IPEDS HD)
|
||||
- `unitid` (PK), `ein`, `institution_name`, `city`, `state`, `sector`, `control`, `carnegie_class`, `enrollment_total`, `year`
|
||||
|
||||
### raw_990_filing (IRS 990 header)
|
||||
- `object_id` (PK), `ein`, `tax_year`, `organization_name`, `return_type`, `filing_date`, `total_revenue`, `total_expenses`, `total_assets`
|
||||
|
||||
### raw_990_schedule_j (one row per person per filing)
|
||||
- `id` (PK), `object_id` (FK), `ein`, `tax_year`, `person_name`, `title`, `base_compensation`, `bonus_compensation`, `other_compensation`, `deferred_compensation`, `nontaxable_benefits`, `total_compensation`, `compensation_from_related`
|
||||
|
||||
### raw_990_part_vii (one row per person per filing)
|
||||
- `id` (PK), `object_id` (FK), `ein`, `tax_year`, `person_name`, `title`, `avg_hours_per_week`, `reportable_comp_from_org`, `reportable_comp_from_related`, `other_compensation`
|
||||
|
||||
### raw_ipeds_finance (one row per institution per year)
|
||||
- `unitid`, `year` (composite PK), `reporting_standard`, `total_expenses`, `instruction_expenses`, `research_expenses`, `public_service_expenses`, `academic_support_expenses`, `student_services_expenses`, `institutional_support_expenses`, `auxiliary_expenses`, `hospital_expenses`, `other_expenses`, `salaries_wages`, `benefits`
|
||||
|
||||
### raw_ipeds_staff
|
||||
- `unitid`, `year` (composite PK), `total_staff`, `faculty_total`, `management_total`
|
||||
|
||||
### raw_cpi_u
|
||||
- `year`, `month` (composite PK), `value`, `series_id`
|
||||
|
||||
### raw_admin_headcount (stretch)
|
||||
- `id` (PK), `unitid`, `institution_name`, `admin_unit`, `page_url`, `scrape_date`, `staff_count`, `staff_names`
|
||||
|
||||
---
|
||||
|
||||
## Key Libraries
|
||||
|
||||
- `lxml` — XML parsing for 990 filings
|
||||
- `duckdb` — database engine
|
||||
- `httpx` or `requests` — HTTP downloads and BLS API
|
||||
- `polars` or `pandas` — CSV processing for IPEDS
|
||||
- `typer` or `click` — CLI framework
|
||||
- `beautifulsoup4` — stretch goal scraper
|
||||
- `irsx` — evaluate for 990 XML parsing
|
||||
- `pytest` — testing
|
||||
|
||||
---
|
||||
|
||||
## Risks
|
||||
|
||||
| Risk | Mitigation |
|
||||
|------|-----------|
|
||||
| IRS XML schema variations break parsing | Use Master Concordance File or IRSx. Test fixtures from multiple schema years. |
|
||||
| IPEDS variable names change year to year | Always parse the data dictionary alongside each file. |
|
||||
| Bulk 990 download too large | First iteration is UD only — minimal download. Design for broader filtering in later iteration. |
|
||||
| IRSx library unmaintained | Evaluate early in Sprint 2. Fallback: custom lxml parser with concordance mappings. |
|
||||
| Title normalization unreliable | Start with high-confidence exact/regex matches. Flag ambiguous titles for manual review. |
|
||||
|
||||
---
|
||||
|
||||
## References
|
||||
|
||||
- [IRS Form 990 Series Downloads](https://www.irs.gov/charities-non-profits/form-990-series-downloads)
|
||||
- [Master Concordance File](https://nonprofit-open-data-collective.github.io/irs-efile-master-concordance-file/)
|
||||
- [IRSx 990-xml-reader](https://github.com/jsfenfen/990-xml-reader)
|
||||
- [Form 990 XML Schema Mapper](https://github.com/Giving-Tuesday/form-990-xml-mapper)
|
||||
- [Schedule J Instructions](https://www.irs.gov/instructions/i990sj)
|
||||
- [IPEDS Data Center](https://nces.ed.gov/ipeds/use-the-data)
|
||||
- [Urban Institute IPEDS Scraper](https://github.com/UrbanInstitute/ipeds-scraper)
|
||||
- [BLS API v2](https://www.bls.gov/developers/api_signature_v2.htm)
|
||||
- [BLS CPI Series IDs](https://www.bls.gov/cpi/factsheets/cpi-series-ids.htm)
|
||||
|
||||
---
|
||||
|
||||
*Generated by Claude · Administrative Analytics Project*
|
||||
Loading…
Add table
Add a link
Reference in a new issue