An open-source pipeline from raw public data to policy-ready deliverables. National scale. Tract-level precision. Raw data first.
Policy Data Infrastructure is an open-source data pipeline that operates at national scale and at the county, tract, block group, and ward level from day one.
The foundation is the Madison Equity Atlas—a 22-layer GIS platform analyzing 125 census tracts in Dane County, Wisconsin. The Atlas produced the statistical methodology, the Python data acquisition core, and the evidence-card framework that generated 70 policy analyses across all 72 Wisconsin counties. PDI builds a Go orchestration layer over that core—a compiled pipeline engine that runs the same analyses faster, at any geographic scope, with a proper API and narrative generation built in.
The Atlas proved what data infrastructure can do. Five Mornings in Madison—five households, five alarm clocks, the same city—proved that when the data is structured right, it produces stories that move people. The Partnership Proposal proved it can build coalitions. The Field Guide proved it can brief decision-makers. PDI generalizes all of this to any county in the country.
Data moves from public sources to policy-ready deliverables through five stages. Each stage has a clear input contract and output format. The pipeline is a DAG—stages run in concurrent waves, bounded by parallelism settings.
External data pulled from public APIs into raw storage. The system knows 12 upstream sources, their rate limits, their geographic resolutions, and their update schedules.
Raw data cleaned, filtered to the target geography, joined to tract-level geometry via PostGIS, and output as standardized indicator records. Every value carries its GEOID, source metadata, and vintage year. Missing data is null, never a sentinel value—because a zero and an absence are two different truths.
The statistical engine reads processed indicators and computes derived metrics: z-score normalization, OLS regression, Blinder-Oaxaca decomposition, bootstrap confidence intervals, and piecewise tipping-point detection. The same methodology that produced the Atlas findings—now generalized to any county in the country.
Single-file HTML applications consume the processed data. Leaflet for choropleth maps, Alpine.js for interactivity, Chart.js for statistical displays—all self-contained, no server required.
This is where the pipeline produces value. Research outputs, grant proposals, partnership materials, and narrative documents that reference real findings. A narrative engine with Go templates that can generate policy deliverables for any county in America, grounded in that county’s actual tract-level data.
The VPS is running at pdi.trespies.dev with PostgreSQL + PostGIS. The project website is live at policydatainfrastructure.com. The pipeline runs end to end. The narrative engine renders. The statistical architecture has been refactored from the ground up.
The Madison Equity Atlas used the Neighborhood Attendance Risk Index (NARI)—a composite of 8 indicators averaged by percentile rank—to identify “priority tracts.” That approach was useful for a prototype at 125 tracts. PDI replaces it with a research-grounded statistical architecture.
Equal-weighted composites hide more than they reveal. The CDC Social Vulnerability Index—16 variables, equal weights—predicts only 38.9% of COVID case variability. Factor analysis shows 3–4 variables carry most of the variance; the other 12 are correlated noise that inflates apparent precision.
Unstandardized composites collapse to proxies. The Area Deprivation Index, when computed without standardizing variables, is 98.8% explained by just 2 variables (income + home value). A “17-variable index” that is functionally a 2-variable proxy.
Rankings are methodologically unstable. Environmental composite index rankings differ by an average of 45 places across alternative weight specifications. If rankings shift substantially under sensitivity analysis, the composite should not be presented as authoritative.
The prototype’s NARI was never tested against an outcome it didn’t contain. Its 8 indicators were selected by intuition, not factor analysis. Its tier cutoffs (80th percentile = “Critical”) were arbitrary. Copying it to national scale without re-validating against real research questions is how interpretability debt accumulates.
“Weights embed hidden normative choices disguised as technical choices.” The Commission recommended presenting composites alongside a dashboard of raw indicators—so the underlying dimensions remain visible and the composite cannot substitute for them.
The refactored architecture has five layers. Each builds on the one below. Composites exist only at the top—computed at query time, never stored as truth.
CompositeIndex() with equal weights → ValidatedFeatures() + query-time composites
AssignTiers() with arbitrary cutoffs → LISA cluster classification from actual spatial patterns
NARI as stored score → Named factor scores + ICE as first-class indicators
Tier badges in narratives → Factor profile descriptions (“Economic Distress: 92nd percentile”)
Each method below is grounded in peer-reviewed literature and tested at the 85,000-tract scale the platform targets.
Local Indicators of Spatial Association. Classifies each tract as High-High (concentrated disadvantage), Low-Low, High-Low (outlier), or Not Significant. The core equity atlas visual.
Interpretability: 5/5Krieger et al. 2016. Measures polarization: (high-income white − low-income POC) / total population. Validated, directional, does not collapse race and income into a dimensionless score.
Interpretability: 5/5Oblimin rotation on 50+ indicators. Parallel analysis for factor count. Names by loading profile, not number. Kolak et al. found 4 factors at 72K tracts explaining 71% variance.
Interpretability: 5/5Segmented regression identifies breakpoints: “Above 35% poverty, diabetes prevalence increases 4x faster.” County-level first, then validate at tract level.
Interpretability: 5/5For a given poverty rate, what is the 10th/50th/90th percentile of health outcomes? Identifies positive-deviance communities beating expectations—more actionable than cataloguing worst cases.
Interpretability: 4/5Tracts nested in counties nested in states. “27% of variation in uninsurance is between states—state Medicaid policy matters as much as local poverty.”
Interpretability: 4/5| Platform | Open Source | National | API | Narrative | Raw-First |
|---|---|---|---|---|---|
| Census Reporter | Yes | Yes | Yes | No | Yes |
| COI 3.0 | Docs only | Yes | Download | No | Composite |
| National Equity Atlas | No | Metro/city | No | No | Dashboard |
| Opportunity Insights | Code only | Yes | Download | No | Yes |
| PolicyMap | No | Yes | No | No | Mixed |
| PDI | Yes | Yes | REST + SSE | Go templates | Yes |
No open-source platform currently offers the full stack: ingestion + statistical computation + API + narrative generation + visualization in one deployable package. PDI is the first to attempt it with a raw-data-first statistical architecture.
The purpose of PDI is to multiply knowledge and power—to boost policy proposals with statistics, raise awareness to issues using data, and connect achievable goals to the communities they would reach.
The vision is that this data infrastructure will produce stories and narratives backed by real facts and figures—not just charts that get filed, but documents that keep organizations in the room after the presentation is over. What Five Mornings did for Madison, PDI can do for any county in the country: turn tract-level indicators into stories that decision-makers cannot ignore.
The infrastructure exists now so that when a campaign asks “what does the data say about food access in Rusk County?” the answer is already computed, the map is already rendered, and the story is already waiting to be told.
Open-source from day one. Apache-2.0. Fork it, deploy it, extend it.
Raw data is the foundation. No unvalidated composites. Every indicator carries a reliability flag. Composites are query-time views, not stored truth.
Research-grounded methods. 30 peer-reviewed sources inform the statistical architecture. Every method traces to published validation.
Narrative generation. The only open-source platform that turns indicator data into policy-ready documents mechanically.
PDI is open to contributors, collaborators, and communities that want to build data infrastructure that serves policy, not just measures it.
Since the rough draft shipped, the infrastructure has been audited, refactored, and extended. Here is what changed.
Exploratory factor analysis on 1,265 Wisconsin tracts across 12 SDOH indicators produced two factors explaining 66.5% of variance (KMO = 0.833):
Factor 1: Mental Health / Economic Deprivation (38.4%) — poverty rate, mental health prevalence, ICE score, healthcare access. These move together across Wisconsin tracts.
Factor 2: Cardiovascular / Metabolic (28.1%) — high blood pressure, diabetes, physical health, obesity. A separate dimension that does not reduce to poverty.
This confirms the refactor decision: averaging these two dimensions into a single composite would hide the fact that they are independent. A tract can score high on economic deprivation and low on metabolic risk, or vice versa. The composite would tell you neither.
ACS table B19001 (household income by race) now provides the cross-tabulated counts needed for the Index of Concentration at the Extremes—replacing the poverty×race approximation from the initial build. 1,524 of 1,542 WI tracts (98.8%) have true ICE scores ranging from −0.65 (concentrated deprivation) to +0.82 (concentrated privilege).
The narrative rendering chain — which generates Five Mornings documents from tract data — was broken after the refactor because it still referenced the old NARI fields. Fixed: the selector, engine, and all three templates now use ICE and factor profiles. 33 tests pass.
The project website is deployed at policydatainfrastructure.com via Cloudflare Pages. It includes a Five Mornings excerpt with sourced statistics, an interactive evidence card explorer, and a six-tab methodology section explaining the statistical architecture. Every number on the site has been audited against source material.
This document is grounded in four research tracks conducted on April 14, 2026, reviewing 30 peer-reviewed and technical sources across validated composite index methodologies, disaggregated analysis methods, the open-source policy data platform landscape, and scalable spatial statistics. Full research documents and a structured reference list are available in the repository.