Synthesizing Tax Data with 

tidy-synthesis


Speaker: Aaron Williams

Abstract: 

Society benefits when leaders have access to high quality data to make evidence-based decisions, but growing privacy concerns hamper decisionmakers’ ability to understand and improve the world. Fully synthetic data present an opportunity to learn from administrative data while minimizing disclosure risks. But many administrative data sets are difficult to synthesize because they come from complex surveys, contain many variables, and contain complex relationships between variables.  


In this talk, I will share how the Urban Institute collaborates with the Statistics of Income Division at the IRS to create fully synthetic data for tax policy research. Motivated by the significant challenges of synthesizing data, we built an R package called tidysynthesis to create machine learning models for each variable in the data. tidysynthesis leverages the power of tidymodels and allows users to run sequences of machine learning models with different recipes, engines, and samplers while adding additional noise and enforcing logical constraints.