Theory and algorithms for Linear mixed models

The problem of variance estimation in high dimensional regression is motivated from genome-wide association studies (GWAS). Although many disease-associated single nucleotide polymorphisms (SNPs) have been identified in GWAS at genome-wide significance level, those identified SNPs can only explain a small fraction of phenotypic variance, which is known as "missing heritability" (e.g., see our book chapter). Instead of only using the significant SNPs, an LMM-based approach was proposed to estimate the phenotypic variance explained by all genotyped SNPs. In statistics, this can be cast as variance estimation in high dimensional regression, where the response vector is the phenotypic values and the design matrix is the genotyped SNP data, respectively.

Note that there could be only a fraction of variables associated with the response vector. Directly application of LMM to estimate the variance explained by all variables may be questionable. This is because too many noise variables are involved and LMM is misspecified. Surprisingly, we observed that REstricted Maximum Likelihood (REML) estimator of the mis-specified LMM works very well regardless of the underlying true model either being sparse or dense. Therefore, LMM can have a big advantage over its competitors (e.g., refitted cross-validation and scaled-Lasso only work well for sparse cases) for variance estimation in high dimensional regression. To explore the theoretical underpinning, we collaborated with Prof. Jiming Jiang and Prof. Debashis Paul at UC Davis and showed that REML estimator in the misspecified LMM is still consistent under some regularity conditions.

More relevant information:

Software: VCM

  • R package VCM: three approaches for solving variance components models: PX-EM algorithm, MM algorithm and Method of Moments. 2019.