Yaqiong Yao, Department of Biostatistics, Columbia University

  • Abstract: A prevailing method to alleviate the computational cost is to perform analysis on a subsample of the full data. Optimal subsampling algorithm utilizes non-uniform subsampling probabilities, derived through minimizing the asymptotic mean squared error of the subsample estimator, to acquire a higher estimation efficiency for a given subsample size. The optimal subsampling probabilities for softmax regression have been studied under the baseline constraint which treats one dimension of the multivariate response differently from other dimensions. Here, we construct optimal subsampling probabilities for summation constraint where all dimensions are handled equally. For parameter estimation, these two model constraints give the same mean responses and only lead to different interpretations of the parameter, so they always produce the same conclusions. For selecting subsamples, however, we show that they lead to different optimal subsampling probabilities and thus produce different results. The summation constraint corresponds to a better subsampling strategy. Furthermore, we derive the asymptotic distribution of the mean squared prediction error, and minimize its asymptotic mean to define the optimal subsampling probabilities that are invariant to model constraints. Simulations and a real data example are provided to show the effectiveness of the proposed optimal subsampling probabilities.