Research

I am enthusiastic to take the unprecedented opportunities in this big data era to develop statistically rigorous and computationally efficient machine learning methods for analysis of large-scale complex biomedical data, and to eventually provide integrated evidence by leveraging rich high-dimensional multi-omics data, electronic health records and biobank data for broader translational impacts. Currently, my core methodological research focuses on: (i) statistical estimation and inference for high-dimensional models; (ii) transfer learning and data fusion; and (iii) integration of multi-omics data. I am also highly motivated to collaborate closely with other statisticians and biomedical scientists to solve emerging problems in modern data science and population health sciences.

Research Interests

Technological advances have made it possible to collect an increasingly large amount of information from biomedical studies, for instance, in the areas of genetics, genomics, proteomics and brain imaging. In particular, high-dimensionality of the collected data, which features a comparable or even larger dimension than the sample size, has confronted the conventional parameter estimation and inference. I am interested in developing statistical methods for analysis of such large, high-dimensional and complex data, with more accurate estimation, and more reliable uncertainty quantification and prediction. My research primarily focuses on statistical models beyond linear regression (e.g. generalized linear models, Cox proportional hazards models, and estimating equations), which pose additional difficulties in computation and theoretical development. 

Following the central dogma of molecular biology, the omics cascade provides a unique opportunity for comprehensive understandings of human diseases. With the advent of modern high-throughput techniques, biological data have became increasingly available at the multiple levels of genome, transcriptome, proteome and metabolome. I am interested in developing statistical methods for integrative analysis of multi-omics data, where, by leveraging the interplay between molecules (e.g. through biological pathways and regulatory networks), it is expected to gain statistical power in detecting disturbed biological processes. Through various collaborations, I also apply state-of-the-art systems biology and statistical machine learning methods to omics data to improve systematic understandings, diagnosis and prognosis of infectious and chronic diseases

With hands-on experience on large-scale healthcare databases, I have developed interests in exploring the rich sources of electronic health records and claims data, potentially together with other well designed and curated studies of smaller scales, for translational medicine research. Monitoring outcomes of health care providers (e.g. patient survival, hospitalization and readmission) is important for the evaluation of the quality of health care and the design of potential quality improvement programs. I am interested in developing statistical methods that leverage large claims and electronic health records databases to enhance fair assessment of the performance of health care providers. My work is widely applicable to assessing providers' performance on patient mortality, hospitalization and readmission outcomes, which can help patients make informed decisions and aid overseers and payers in identifying underperforming providers.