Yanxi Liu, University of Illinois at Chicago
Title: Information-based Optimal Subdata Selection for Clusterwise Linear Regression
Abstract: Technological advancements have accelerated in recent years. The amount of data being collected and the size of the data are increasing exponentially. Over time, it becomes more challenging to deal with not just massive amounts of data but also their complexity. The relationship between input and output variables may not be homogeneous anymore. Conventional statistical models such as generalized linear models (GLMs) may not be well-suited to heterogeneous relationships. Using a Mixture of Expert models is a good solution. The mixture of Expert models can combine different models. It can detect heterogeneous patterns while maintaining the benefits of conventional statistical modeling techniques. It does, however, need a considerable amount of computer resources, particularly when working with huge quantities of data. The subdata approach is a technique for resolving this issue. Inspired by Wang, Yang, and Stufken (2019), the purpose of this project is to develop an algorithm for clusterwise linear regression, a type of Mixture of Experts, to select optimal subdata from the full data set, which preserves the maximum amount of information while requiring minimal computing resources. In this project, the proposed subdata selection is proved to be asymptotically optimal, i.e., no other method is statistically more efficient than the proposed one when the full data size is large.