Synthetic US Population with PopGen2

 In order to simulate the US travel (as part of Texas Department of Transportation research project 0-7081), UT Austin post-doctoral fellow, Dr. Yantao Huang, created a 10% synthetic population for Dr. Kara Kockelman’s research team. These details are at the level of the nation’s 73,056 census tracts, because the team’s model equations rely on land use variables at that detailed level. The team was able to use Pendyala’s PopGen2.0 software, but had to run this massive data synthesis on the Texas Advanced Computing Center’s super-computers over several days. To avoid having others duplicate such work, we want to share those results here, for you to use!  Please have a look, and let us know (yantao.huang@anl.gov & kkockelm@mail.utexas.edu) if you have any suggestions. Thanks, & happy travel demand modeling! (Or land use forecasting, emissions estimation, crash forecasting, etc.) 

Introduction

This website provides 10% of the whole US household and person synthetic data at census tract level, based on marginals from 5-year American Community Survey data in 2019, using PopGen 2.0 software, developed by Pendyala et al. (2011) and Ye et al. (2009). 

The household and person data were synthesized across 2,351 “Public Use Microdata Areas” (PUMAs), to mimic the population distributed across the whole US (including 50 states and the District of Columbia), consistent with census datasets and geographic-correspondence files. The PopGen 2.0 software, developed by Pendyala et al. (2011) and Ye, et al. (2009), was leveraged to perform the population synthesis. Margins of household income, household size, and gender, age, race, and education of persons were scraped from the Census Bureau and processed as the input for PopGen. The output of the synthesis is a 10% sample of both Americans’ household and person data that are well-matched to each other, and also matched to the control margins.

Data

The main sources of data from ACS and Public Use Microdata Sample (PUMS) are prepared as the input for the PopGen 2.0 model. The personal and household margins from ACS at census travel level are used, with detailed categories shown in Table 1. The household and person samples are extracted from PUMS.

Categories of Margin Control Variables 

Household and Person Files

Synthetic household file (698MB)

Synthetic person file (1.7GB)

Public Use Microdata Sample

The PUMS 2019 data and dictionary were extracted from US Census Bureau FTP site. The household IDs and person IDs correspond to the IDs in synthetic files. 

2019 PUMS - household samples (part A, 442MB)

2019 PUMS - household samples (part B, 425MB)

2019 PUMS - person samples (part A, 1.08GB)

2019 PUMS - person samples (part B, 1.03 GB)

2019 PUMS data dictionary

Census Tract GEOID and geo ID matching

References

Pendyala, R.M., K.C. Konduri, and K.P. Christian (2011) PopGen 1.1 User’s Manual. Lulu.com Publishers, USA.

Ye, X., Konduri, K., Pendyala, R. M., Sana, B., & Waddell, P. (2009, January). A methodology to match distributions of both household and person attributes in the generation of synthetic populations. In 88th Annual Meeting of the transportation research Board, Washington, DC.

U.S. Census Bureau (2022). Public Use Microdata Sample (PUMS). Retrieved from: https://www.census.gov/programs-surveys/acs/microdata.html

U.S. Census Bureau (2022). American Community Survey Data. Retrieved from: https://www.census.gov/programs-surveys/acs/data.html

Acknowledgment

The synthetic data were originally used for the TxDOT 0-7081 project “Understanding the Impacts of Autonomous Vehicles on Long-distance Travel Choices across Texas”. The authors also thank the support of Dr. Ram Pendyala and his team at Arizona State University and the support of Texas Advanced Computing Center.

Contact

Please reach out to Yantao Huang (yantao.huang@anl.gov) and Kara Kockelman (kkockelm@mail.utexas.edu) for questions about the synthetic person and household files.