My research is mostly motivated by the collaborative research I have been working on. In particular, currently I am highly enthusiastic about applying innovative statistical methods and theory in cancer and infectious disease research. In the meantime, I am also passionate about promoting the best statistical practices for all areas. Below are my selected research topics.
Semiparametric Models and Theory: I work extensively with survey study data in my daily collaborations, including national health databases and local data consortiums. Since survey sampling happens when the population size is big and many times we only observe limited survey samples due to data collection limitations and budget constraints, semiparametric models are leveraged to make inference about the original population based on smaller survey samples. This type of "sampling thinking" can also be applied to address missing data and causal inference problems (see this paper about frequency and probability weights for details). My goal is to apply semiparametric models to survey and many other large-scale observational studies and make robust inference by adopting the semiparametric efficiency theory. There are many excellent educational resources on this topic, and I highly recommend Semiparametric Theory and Missing Data by Dr. Anastasios A. Tsiatis and Dr. Edward H. Kennedy's notes.
A New Class of Semiparametric Models for Between-subject Attributes: We developed a new class of semiparametric functional response models (FRM) to model both classic within- and ingenious between-subject attributes in regression analysis. FRM could be particularly useful when the response variable in a regression model is a function of the attributes of both members from a paired level, or between-subject attributes, because obtaining the asymptotic properties for estimators of such regression model is challenging. We have been leveraging FRM to model the Mann-Whitney-Wilcoxon type of outcomes when the data are highly skewed and to model the network data since the edges in a network are intrinsically endogenous between-subject attributes.
Machine Learning Modeling and Inference: Machine learning methods have been widely adopted in disease prediction such as cancer diagnosis and prognosis and COVID-19 infection prediction. Recent developments in semiparametric literature have ignited our passion for statistical inference of machine learning methods. Specifically, our interest lies in modeling the nuisance using machine learning models and making the correct inference about the parameters of interest by partializing out the nuisance. Additionally, I am also working on building novel machine learning models and their inferential frameworks for cancer predictions and early prevention and between-subject attributes such as network data.
*Featured upcoming talks and lectures (welcome to join):
Aug 2-7, 2025 -- "Robust Causal Estimation using Random Forests" Topic contributed session (session organizer), JSM 2025, Nashville, TN
Aug 2-7, 2025 -- "Unlocking the Power of Semiparametric Models: A Practical Tutorial for Analyzing Complex Data with Minimum Assumptions" Professional Development Course/CE, JSM 2025, Nashville, TN
Aug 2-7, 2025 -- "Common Misunderstandings of Weights in Survey Studies and Beyond" Roundtable discussion, JSM 2025, Nashville, TN
Aug 2-7, 2025 -- "Recent Advances in the Use of Sequence Data in Infectious Disease Tracking" Invited paper session (session chair), JSM 2025, Nashville, TN