Stylized facts about synthetic data for the Social Sciences – An illustration using data from the U.S. Economic Census


Speaker: Joerg Drechsler

Abstract: 

In the Social Sciences, synthetic data are typically used for confidentiality reasons. Most data used in the Social Sciences is collected through surveys based on complex sampling designs which imply that the data cannot be treated as a simple random sample from the population of interest. Furthermore, most datasets contain dozens if not hundreds of variables with complex relationships and logical constraints that need to be preserved in the synthetic data. Finally, the user community is very diverse ranging from political actors satisfied with simple summary statistics computed from the data to the economics professor using the data for complex econometric models. All these aspects need to be taken into account when generating synthetic data for the Social Sciences.


In this talk I will illustrate how we addressed these challenges in a project that aimed at developing synthesis methodology for the U.S. Economic Census. This is joint work with Hang Kim (University of Cincinnati) and Katherine J. Thompson (U.S. Census Bureau).