Session V (May 16, 8:30am-10:00am): Orthogonal arrays and related designs, organized by Boxin Tang
Title: Active Sampling for High-dimensional Ridge Estimator with Application in Genome-wide Association Studies
Speaker: Lin Wang, Purdue University
Abstract: Despite the availability of extensive data sets, it is often impractical to collect labels for all data points in many applications due to various measurement constraints. Subsampling approaches can be employed to select a subset of design points from a large pool, resulting in substantial savings in experimental costs. However, existing subsampling methods are primarily designed for low-dimensional data or rely on the assumption of sparse significant covariates. In this study, we propose a computationally tractable sampling method that enables the selection of a small subset from a large data set without assuming sparsity. Our method acknowledges the possibility that the number of significant covariates can be as large as or even larger than the sample size of the full data set. Specifically, our focus lies on ridge regression, for which we develop sampling probabilities that minimize the mean squared predictive risk on the full data set. The efficacy of our proposed approach is substantiated through theoretical analysis and extensive simulations. The results demonstrate its superiority over existing subsampling methods when dealing with high-dimensional data containing numerous significant covariates. Additionally, we illustrate the advantages of our new approach through its application to genome-wide association studies, highlighting its potential to yield valuable insights in this domain.