With advanced technologies in data collection and storage, data analysis in modern scientific research and practice has shifted from analyzing a single dataset to coupling several datasets. We focus on how to make use of external information, which generally provides partial information. The possible external sources are historical data, data from personal devices, and population census.
Prediction: We consider nonparametric kernel regression in an internal dataset analysis utilizing constraints for auxiliary information from an external dataset with either summary statistics or individual-level data. Our algorithm is asymptotically normal and is better than the standard kernel regression without using external information in terms of the asymptotic mean integrated square error. And it is robust even if the distribution of external data is different from the primary data.
Causal Inference: A randomized controlled trial has been a gold standard for causal discovery. We consider that some observational external control data, possibly with a much larger sample size, are available. Ideally, we can have a better estimation of the control effect. However, due to the model misspecification, simply combining two data sets can lose efficiency. We construct a safe and efficient algorithm showing that external control data is helpful even with model misspecification.
Chebyshev Greedy Algorithms: Under high-dimensional covariates, we consider a greedy algorithm instead of applying a popular algorithm, "L1-regularization". Inspired by gradient descent, we select the variable with the most significant gradient. The proposed algorithm can apply to any convex loss. For example, least-square and generalized linear models. We provide a convergence rate of the algorithm, which can reach the minimax rate.
Log-binomial Models: There is a high demand for binary data modeling in health technology assessment (HTA) applications. Specifically, BARDS-HTA Statistics summarize binary data through risk ratios in compliance with guidance documents from the HTA agencies. In this case, log-binomial regression is more suitable than logistic regression for these analyses. However, it is well-known that the log-binomial regression has several limitations, including a high non-convergence rate, specifically in the presence of multiple covariates and due to extreme event counts. The non-convergence issue can be addressed by adding constraints to the log-binomial regression.