Data Minding for Synthetic Data: Have We Ever Had Real Data?
Speaker: Xiao-Li Meng
Abstract:
Data minding refers to a thorough examination process to ensure data quality, an essential step before data mining or analysis. This process should encompass inquiries into many components of the data life cycle, starting with data conceptualization. This includes understanding the intention of data creation, which is often underappreciated, and its implied data resolution, which is rarely emphasized. The term 'data creation' should remind us that data are human constructs, despite the commonly held belief in 'raw data' and 'letting the data speak.’ Understanding the implied data resolution provides a sagacious starting point for creating data that strike a balance among descriptive, ascriptive, and prescriptive variations, all of which are context and resolution dependent.
This talk invites the audience to take a break from data mining to join a data-minding excursion, starting with studying a tour map from Harvard Data Science Review for a behind-the-scenes look at some 'real data' factories. After (their perspectives) sufficiently stretched, the audience will then participate in a 'doubly synthetic' experiment in progress, where multiple imputation is used to deal with privacy-induced spatial dislocations in ground-level survey data before being combined with satellite data via deep learning for assessing living conditions in all communities in Africa. (This experiment is a part of a forthcoming article on “Statistical preprocessing for privacy-induced spatial mismatch: A multiple imputation approach with deep learning” by Kakooei, Bailie, Daoud, and Meng.)
Through this excursion, we hope to provide a useful lens for appreciating data by their revelatory and obfuscatory variations. We will also remind ourselves to avoid creating synthetic data with predatory variations, and to embrace a motivational mantra: be congenial in life, as you can hardly find it in synthetic data.