Add endowment spending distribution, move planning docs to private

- Add IPEDS F2H03C (spending distribution for current use) to endowment schema, loader, queries, and dashboard
- Endowment tab now shows spend rate alongside investment return rate
- Move planning docs to planning/ directory (gitignored)
- Update data dictionary

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
emfurst 2026-04-01 07:27:57 -04:00
commit a766f6ff0d
8 changed files with 22 additions and 344 deletions

1
.gitignore vendored
View file

@ -8,3 +8,4 @@ __pycache__/
dist/
build/
.pytest_cache/
planning/

View file

@ -1,118 +0,0 @@
# Administrative Analytics — Project Scope
**Version:** 0.1 | **Status:** Draft | **Date:** March 2026
---
## Problem statement
University administrative costs have grown significantly over the past two decades, yet institutions lack easy tools to benchmark their administrative spending against peer institutions or correlate those costs with performance outcomes. This project aims to close that gap using publicly available data.
---
## Objectives
Build a data pipeline and analytics dashboard that aggregates public data on university administrative costs, benchmarks institutions against peers, and surfaces correlations between administrative spend and key performance indicators such as fundraising revenue.
**First iteration scope:** Data acquisition and analysis will focus exclusively on the **University of Delaware**. Peer institution comparisons (AAU members, Carnegie peers, etc.) will be added in a later iteration.
---
## Data sources
### Primary
- **IRS Form 990** — private universities and non-profits
- **IPEDS** (Integrated Postsecondary Education Data System) — all institutions, including public universities
- **NACUBO endowment study reports**
### Secondary
- **BLS CPI-U data** — Consumer Price Index for All Urban Consumers, for inflation-adjusted compensation analysis
- University philanthropy and fundraising reports (public fact books)
- Chronicle of Higher Education data
- Public institutional fact books
### Stretch
- **Institutional administrative office web pages** — scrape Provost, President, VP unit, and college administration pages for staff directories / headcount tracking
---
## Key metrics
### Cost metrics
- Admin cost per student
- Admin-to-faculty ratio
- Administrative spending as % of total expenses
### Compensation metrics
- Key employee salaries from IRS 990 Schedule J (President, Provost, VPs, Deans, etc.)
- Year-over-year compensation growth per position
- Compensation growth vs. CPI-U (Bureau of Labor Statistics)
### Performance metrics
- Philanthropic revenue raised
- Endowment growth year-over-year
- Grant funding secured
### Benchmarking (later iteration)
- Peer institution comparisons by Carnegie classification, size, and public/private status
- AAU institution comparisons
### Trends
- Year-over-year cost and performance trajectories (510 year views)
---
## Phases
### Phase 1 — Data acquisition
Build parsers for IRS 990 filings (including Schedule J key employee compensation) and IPEDS data for the **University of Delaware** only. Ingest BLS CPI-U series for inflation benchmarking. Establish a raw data store. **Stretch:** prototype scraper for UD administrative office web pages to track headcount.
### Phase 2 — Data pipeline & normalization
Clean, normalize, and reconcile data across sources. Define a unified schema. Build institution-matching logic to link records across datasets using IPEDS Unit ID as the canonical identifier.
### Phase 3 — Internal analytics dashboard
Build an internal tool for our institution to explore cost and performance data. Validate findings with stakeholders.
### Phase 4 — Multi-institution expansion
Extend data acquisition to peer institutions (AAU members, Carnegie peers, etc.). Add benchmarking comparisons, configurable peer groups, and export features.
---
## Technical approach
### Data collection
- Python scraping with Scrapy / BeautifulSoup for supplementary sources
- IRS 990 XML bulk data parser (IRS provides annual bulk downloads)
- IPEDS bulk data files (CSV downloads, no scraping required)
### Data storage & transformation
- PostgreSQL or DuckDB as primary data store
- dbt for data transformation and modeling
### Frontend & API
- React with a charting library (e.g., Recharts or Observable Plot)
- REST or GraphQL API layer
---
## Risks & considerations
| Severity | Risk | Notes |
|----------|------|-------|
| Medium | Data completeness | Public universities do not file 990s. IPEDS is the primary fallback and provides expense breakdowns by function. |
| Medium | Institution matching | Names vary across datasets. Use IPEDS Unit ID as the canonical identifier from the start. |
| Low | Rate limiting | IRS and IPEDS data are available as bulk downloads; scraping is mostly not required for core datasets. |
| Low | Data licensing | All target sources are public domain or open government data. No licensing barriers anticipated. |
---
## Out of scope (v1)
- Internal financial systems integration
- Real-time data feeds
- Accreditation or ranking data
- Faculty / HR compensation analysis (key executive compensation from 990s *is* in scope)
---
*Generated by Claude · Administrative Analytics Project*

View file

@ -162,6 +162,7 @@ Raw data layer for University of Delaware administrative analytics. All tables a
| endowment_eoy | BIGINT | Endowment value, end of fiscal year | F2H02 |
| new_gifts | BIGINT | New gifts and additions to endowment | F2H03A |
| net_investment_return | BIGINT | Net investment return on endowment | F2H03B |
| spending_distribution | BIGINT | Spending distribution for current use (negative) | F2H03C |
| other_changes | BIGINT | Other changes in endowment value | F2H03D |
| total_private_gifts | BIGINT | Total private gifts, grants, and contracts | F2D08 |
| total_investment_return | BIGINT | Total investment return (all funds) | F2D10 |

View file

@ -1,220 +0,0 @@
# Phase 1 — Data Acquisition: Implementation Plan
**Version:** 0.1 | **Status:** Draft | **Date:** March 2026
---
## Summary
Phase 1 builds the data ingestion layer for three primary sources (IRS 990 XML, IPEDS CSV, BLS CPI-U) and a stretch-goal web scraper, scoped to the **University of Delaware** only. Peer/AAU/multi-institution comparisons are deferred to a later iteration. The deliverable is a populated DuckDB database with UD's raw data ready for Phase 2 normalization.
---
## Data Source Details
### IRS 990 Bulk XML
**Location:** `https://apps.irs.gov/pub/epostcard/990/xml/{YEAR}/`
**Key extractions:**
- Filing header: EIN, tax year, org name, total revenue/expenses/assets
- Part VII: per-person compensation summary (name, title, hours, reportable comp)
- **Schedule J**: detailed compensation breakdown per person — base, bonus, other, deferred, nontaxable benefits, total
**Schema considerations:**
- XML element names vary across tax years. The [Master Concordance File](https://nonprofit-open-data-collective.github.io/irs-efile-master-concordance-file/) maps field variations.
- [IRSx (990-xml-reader)](https://github.com/jsfenfen/990-xml-reader) handles versioned XPath differences — evaluate first, fallback to custom lxml parser.
- Only private/nonprofit universities file 990s. UD is a public university and does **not** file a 990. However, the University of Delaware Foundation (a separate nonprofit) does file a 990 — this is the source for UD executive compensation and philanthropic data.
- For the first iteration, filter to UD/UD Foundation EINs only. Broader higher-ed filtering (NTEE codes B40-B43, EIN crosswalk) is deferred to the multi-institution iteration.
### IPEDS Bulk CSV
**Location:** `https://nces.ed.gov/ipeds/datacenter/` — Complete Data Files
**Key survey components:**
| Component | What it provides |
|-----------|-----------------|
| HD (Directory) | UNITID, name, EIN, Carnegie classification, sector, control |
| F1A / F2 (Finance) | Expenses by function — instruction, research, academic support, **institutional support (admin)**, etc. |
| S / SAL (HR) | Staff counts by occupational category, faculty counts, salary outlays |
| EF (Enrollment) | Student headcounts for per-student cost calculations |
**Note:** Variable names change across years. Always parse the accompanying data dictionary; never hardcode variable names.
### BLS CPI-U
**Series ID:** `CUUR0000SA0` — CPI-U, Not Seasonally Adjusted, U.S. City Average, All Items
**Preferred method:** Flat-file download from `https://download.bls.gov/pub/time.series/cu/cu.data.0.Current` (simpler than API, no rate limits, full history in one file).
**Alternative:** BLS API v2 at `https://api.bls.gov/publicAPI/v2/timeseries/data/` (requires free registration key).
### Stretch: Admin Office Web Pages
First iteration targets **University of Delaware only**. Curate URLs for UD's Office of the President, Provost, VP units, and college admin pages. Use `requests` + `BeautifulSoup` to extract staff directory headcounts. Expand to 20-30 peer institutions in a later iteration.
---
## Sprint Plan
### Sprint 1 (Weeks 1-2): Foundation + IPEDS
IPEDS goes first — it provides UD's UNITID and institutional context. All IPEDS downloads can pull full files but loading is filtered to **UD's UNITID only**.
| Task | Description |
|------|-------------|
| 1.1 | Project scaffolding: `pyproject.toml`, directory structure, DuckDB setup, CLI skeleton |
| 1.2 | IPEDS Downloader: download HD complete data files for 2005-2024, parse data dictionaries |
| 1.3 | IPEDS Institution Loader: parse HD files, filter to UD's UNITID, load `raw_institution` |
| 1.4 | IPEDS Finance Parser: download and parse F1A (GASB — UD is public) finance files for UD, map year-varying variable names to canonical columns |
| 1.5 | IPEDS HR Parser: download and parse Fall Staff / Salaries survey files for UD |
| 1.6 | IPEDS Enrollment: parse EF for UD total headcount |
| 1.7 | Tests for all IPEDS parsers using fixture files |
### Sprint 2 (Weeks 3-4): IRS 990
| Task | Description |
|------|-------------|
| 2.1 | Identify UD-related EINs: University of Delaware Foundation and any other UD-affiliated nonprofit entities that file 990s |
| 2.2 | 990 Downloader: download yearly index CSVs, filter to UD-related EINs only, download matching XML files |
| 2.3 | 990 XML Parser — Filing Header: extract EIN, tax year, org name, return type, financials |
| 2.4 | 990 XML Parser — Part VII: extract per-person compensation summaries, handle XPath variations |
| 2.5 | 990 XML Parser — Schedule J: extract detailed compensation breakdowns (base, bonus, other, deferred, nontaxable, total) |
| 2.6 | Title Normalization: regex patterns + manual mapping to bucket titles into canonical roles (President, Provost, VP, Dean, CFO, etc.) |
| 2.7 | Tests with fixture XML files covering different schema years |
### Sprint 3 (Week 5): BLS CPI-U + Integration
| Task | Description |
|------|-------------|
| 3.1 | BLS CPI-U Fetcher: flat-file download of `cu.data.0.Current`, load `raw_cpi_u` |
| 3.2 | CLI wiring: `ingest ipeds`, `ingest irs990`, `ingest cpi`, `ingest all` with `--year-range` and `--force-redownload` flags |
| 3.3 | Data validation queries: row counts, NULL rates, year coverage, cross-source consistency |
| 3.4 | Data dictionary documentation |
### Sprint 4 (Week 6): Stretch — Admin Page Scraper
| Task | Description |
|------|-------------|
| 4.1 | Curate UD admin office URLs: Office of the President, Provost, VP units, college dean pages |
| 4.2 | Scraper prototype: `requests` + `BeautifulSoup` for UD staff directory pages |
| 4.3 | Headcount extraction: parse pages for staff names/counts, load `raw_admin_headcount` |
| 4.4 | Document limitations and accuracy |
---
## Dependency Graph
```
Sprint 1: IPEDS ──────────────────┐
(provides UNITID-EIN crosswalk) │
Sprint 2: IRS 990 ────────────────┐
(uses EIN crosswalk to filter) │
Sprint 3: BLS CPI-U (independent) │
+ CLI + Validation ─────────────┤
Sprint 4: Stretch scraper ────── Phase 1 Complete
Phase 2: SKIPPED (folded into dashboard queries)
Sprint 5: Dash dashboard ──────── Phase 3 Prototype
```
---
## Phase 2 Skip Decision (March 2026)
Phase 2 (data pipeline & normalization) was **skipped** for the initial prototype. All derived metrics — admin cost ratios, CPI adjustments, compensation growth indices — are computed directly in dashboard SQL queries rather than a separate normalized schema.
**Rationale:** With a single institution (UD) and a populated DuckDB, the query layer is sufficient for a local prototype. A proper Phase 2 with dbt transformations and a unified analytical schema should be built before:
- Expanding to multi-institution comparisons (Phase 4)
- Moving to a production React dashboard (Phase 3 full build)
- Adding complex cross-source joins that benefit from materialized views
---
## Technical Decisions
| Decision | Choice | Rationale |
|----------|--------|-----------|
| Database | DuckDB | Zero-config, fast analytical queries, native CSV/Parquet, single-file portable. Migrate to PostgreSQL in Phase 3. |
| Package manager | uv | Fast, modern, handles virtualenvs and lockfiles. |
| 990 XML parsing | Evaluate IRSx first, fallback to custom lxml + Master Concordance File | IRSx handles schema versioning. If unmaintained or incomplete for Schedule J, build custom. |
| IPEDS variable mapping | Map to canonical names at ingest | Keeps DB schema stable across years. Raw files stay on disk. |
| 990 download strategy | Filter to UD Foundation EIN(s) from index files | First iteration is a single institution; broader filtering deferred. |
| BLS data | Flat-file download | Simpler than API, no rate limits, single file covers full history. |
| Admin scraper | requests + BeautifulSoup | Scrapy is overkill for 20-30 targeted sites. |
---
## Database Schema
### raw_institution (from IPEDS HD)
- `unitid` (PK), `ein`, `institution_name`, `city`, `state`, `sector`, `control`, `carnegie_class`, `enrollment_total`, `year`
### raw_990_filing (IRS 990 header)
- `object_id` (PK), `ein`, `tax_year`, `organization_name`, `return_type`, `filing_date`, `total_revenue`, `total_expenses`, `total_assets`
### raw_990_schedule_j (one row per person per filing)
- `id` (PK), `object_id` (FK), `ein`, `tax_year`, `person_name`, `title`, `base_compensation`, `bonus_compensation`, `other_compensation`, `deferred_compensation`, `nontaxable_benefits`, `total_compensation`, `compensation_from_related`
### raw_990_part_vii (one row per person per filing)
- `id` (PK), `object_id` (FK), `ein`, `tax_year`, `person_name`, `title`, `avg_hours_per_week`, `reportable_comp_from_org`, `reportable_comp_from_related`, `other_compensation`
### raw_ipeds_finance (one row per institution per year)
- `unitid`, `year` (composite PK), `reporting_standard`, `total_expenses`, `instruction_expenses`, `research_expenses`, `public_service_expenses`, `academic_support_expenses`, `student_services_expenses`, `institutional_support_expenses`, `auxiliary_expenses`, `hospital_expenses`, `other_expenses`, `salaries_wages`, `benefits`
### raw_ipeds_staff
- `unitid`, `year` (composite PK), `total_staff`, `faculty_total`, `management_total`
### raw_cpi_u
- `year`, `month` (composite PK), `value`, `series_id`
### raw_admin_headcount (stretch)
- `id` (PK), `unitid`, `institution_name`, `admin_unit`, `page_url`, `scrape_date`, `staff_count`, `staff_names`
---
## Key Libraries
- `lxml` — XML parsing for 990 filings
- `duckdb` — database engine
- `httpx` or `requests` — HTTP downloads and BLS API
- `polars` or `pandas` — CSV processing for IPEDS
- `typer` or `click` — CLI framework
- `beautifulsoup4` — stretch goal scraper
- `irsx` — evaluate for 990 XML parsing
- `pytest` — testing
---
## Risks
| Risk | Mitigation |
|------|-----------|
| IRS XML schema variations break parsing | Use Master Concordance File or IRSx. Test fixtures from multiple schema years. |
| IPEDS variable names change year to year | Always parse the data dictionary alongside each file. |
| Bulk 990 download too large | First iteration is UD only — minimal download. Design for broader filtering in later iteration. |
| IRSx library unmaintained | Evaluate early in Sprint 2. Fallback: custom lxml parser with concordance mappings. |
| Title normalization unreliable | Start with high-confidence exact/regex matches. Flag ambiguous titles for manual review. |
---
## References
- [IRS Form 990 Series Downloads](https://www.irs.gov/charities-non-profits/form-990-series-downloads)
- [Master Concordance File](https://nonprofit-open-data-collective.github.io/irs-efile-master-concordance-file/)
- [IRSx 990-xml-reader](https://github.com/jsfenfen/990-xml-reader)
- [Form 990 XML Schema Mapper](https://github.com/Giving-Tuesday/form-990-xml-mapper)
- [Schedule J Instructions](https://www.irs.gov/instructions/i990sj)
- [IPEDS Data Center](https://nces.ed.gov/ipeds/use-the-data)
- [Urban Institute IPEDS Scraper](https://github.com/UrbanInstitute/ipeds-scraper)
- [BLS API v2](https://www.bls.gov/developers/api_signature_v2.htm)
- [BLS CPI Series IDs](https://www.bls.gov/cpi/factsheets/cpi-series-ids.htm)
---
*Generated by Claude · Administrative Analytics Project*

View file

@ -109,9 +109,14 @@ def layout(conn: duckdb.DuckDBPyConnection):
))
components_fig.add_trace(go.Bar(
x=pd["year"], y=pd["new_gifts"] / 1e6,
name="New Gifts to Endowment",
name="New Gifts",
marker_color="#00539F",
))
components_fig.add_trace(go.Bar(
x=pd["year"], y=pd["spending_distribution"] / 1e6,
name="Spending Distribution",
marker_color="#E07A5F",
))
if "other_changes" in pd.columns:
components_fig.add_trace(go.Bar(
x=pd["year"], y=pd["other_changes"] / 1e6,
@ -125,18 +130,24 @@ def layout(conn: duckdb.DuckDBPyConnection):
template="plotly_white", height=400,
)
# Investment return rate
# Investment return rate and spend rate
rate_fig = go.Figure()
rates = pd.copy()
rates["return_pct"] = rates["net_investment_return"] * 100 / rates["endowment_boy"]
rates["spend_pct"] = rates["spending_distribution"].abs() * 100 / rates["endowment_boy"]
rate_fig.add_trace(go.Scatter(
x=rates["year"], y=rates["return_pct"],
mode="lines+markers", name="Return %",
mode="lines+markers", name="Investment Return %",
line={"color": "#00539F"},
))
rate_fig.add_trace(go.Scatter(
x=rates["year"], y=rates["spend_pct"],
mode="lines+markers", name="Spend Rate %",
line={"color": "#E07A5F"},
))
rate_fig.add_hline(y=0, line_dash="dot", line_color="#ccc")
rate_fig.update_layout(
title="Endowment Net Investment Return Rate (%)",
title="Endowment Investment Return vs Spend Rate (%)",
xaxis_title="Year", yaxis_title="%",
template="plotly_white", height=380,
)

View file

@ -379,7 +379,8 @@ def query_endowment(conn: duckdb.DuckDBPyConnection) -> pl.DataFrame:
"""Endowment performance over time."""
return conn.execute("""
SELECT year, endowment_boy, endowment_eoy, new_gifts,
net_investment_return, other_changes, long_term_investments
net_investment_return, spending_distribution, other_changes,
long_term_investments
FROM raw_ipeds_endowment
WHERE unitid = ?
ORDER BY year

View file

@ -108,6 +108,7 @@ TABLES = {
endowment_eoy BIGINT,
new_gifts BIGINT,
net_investment_return BIGINT,
spending_distribution BIGINT,
other_changes BIGINT,
total_private_gifts BIGINT,
total_investment_return BIGINT,

View file

@ -47,6 +47,7 @@ F2_ENDOWMENT_VARIANTS = {
"endowment_eoy": ["F2H02"],
"new_gifts": ["F2H03A"],
"net_investment_return": ["F2H03B"],
"spending_distribution": ["F2H03C"],
"other_changes": ["F2H03D"],
"total_private_gifts": ["F2D08"],
"total_investment_return": ["F2D10"],
@ -55,7 +56,7 @@ F2_ENDOWMENT_VARIANTS = {
ENDOWMENT_COLUMNS = [
"unitid", "year", "endowment_boy", "endowment_eoy", "new_gifts",
"net_investment_return", "other_changes", "total_private_gifts",
"net_investment_return", "spending_distribution", "other_changes", "total_private_gifts",
"total_investment_return", "long_term_investments",
]