Context
The question was whether low-income and rural communities experienced a disproportionate rate of drinking water violations compared to wealthy urban ones. Answering it required joining 33 years of EPA violation records with Census income data and housing density across every community water system in the United States.
What I built
I built the data pipeline from raw federal data to a publication-ready graphic.
Geocoding pipeline — Used the Google Maps API to geocode every U.S. community water system (CWS) address. Exported to JSON for a QGIS spatial join to assign county FIPS codes to each system.
Income data pipeline — Loaded, merged, and cleaned decade-by-decade Census county median household income data from 1969 to 2015. Normalised all values to 2015 dollars using CPI adjustment. Filled missing inter-Census years using PCHIP interpolation (SciPy) to produce a continuous annual series.
Master join — Joined violation counts, county median income, and housing unit density for every CWS for every year from 1982 to 2015 — producing a ~624,000-row analysis-ready CSV.
Demographic segmentation — Segmented water systems into four groups: rural low-income, rural high-income, suburban, and urban. Urbanicity defined by housing density thresholds from EPA and Census definitions; income threshold at 75% of the national median (inflation-adjusted per year). Computed average violations per system per year for each group and exported four time-series JSON files for the visualisation.
Outcome
The figure was published in Public Health Post, a public health publication at Boston University and in the book Pained: Uncomfortable Conversations about the Public's Health.