Research

My research centers around various aspects of data science, including:

- High-Dimensional Statistical Learning;
- Large Scale Statistical Computation;
- Nonparmetric and Semiparametric Modeling;
- Time Series and Quantitative Finance;
- Survival Analysis and Public Health;
- Artificial Intelligence.

I love to learn about interesting problems from various disciplines! I am particularly fascinated by effective strategies, fast algorithms and the beautiful theories, that are inspired by and can inform practical applications.

Inference for High Dimensional Data

My research in high dimensional field contributes to a deeper understanding of the effect of high dimensionality on statistical inference.

In the joint work with J. Fan and N. Hao variance estimation using refitted cross-validation in ultra-high dimensional regression (J. Royal. Statist. Soc. Ser. B., 2012), we addressed challenges of high dimensionality on variance estimation and proposed a new approach to estimate the variance in ultrahigh dimensional linear models. It turns out that some unimportant variables are often selected and have high empirical correlations with the noises, although assuming that they are independent. This phenomenon is inherent in ultrahigh dimensional problems and called spurious correlation. Based on this finding, we illustrated that the naive two-stage approach performs bad even in the simplest case. Moreover, we show that the bias can be at the order of $\hat{s} \log{p}/n$, where $\hat{s}$ is the selected model size and $p$ is the number of predictors. Clearly, the bias of the naive two-stage estimator is often non-negligible and becomes larger when $\hat{s}$ and $p$ increase. We also illustrated that the plug-in estimator based on Lasso procedure has a large bias at the order of $\hat{s} \log{p}/n$.

We proposed a refitted cross-validation (RCV) technique for variance estimation. The RCV approach is fundamentally different from the traditional two-stage approach and the classical cross-validation. In our new approach, the model selection and refitted stage are done in two different group of data. This is key to reducing modeling biases due to spurious correlation. Moreover, we prove that the proposed estimator $\hat{\sigma}^2_{RCV}$ is unbiased as long as all important variables are selected in the first stage and asymptotically normal under regularity conditions. This reveals that the RCV estimator $\hat{\sigma}^2_{RCV}$ of variance has an oracle property.

In the joint work with J. Box and W. Zhang a dynamic structure for high dimensional covariance matrices and its application in portfolio allocation (J. Amer. Statist. Assoc., 2016), we considered the problem of high dimensional covariance matrices and introduced a dynamic structure for covariance matrices which depends on some low dimensional predictors. This dynamic structure is motivated by portfolio allocation in finance. We show that our proposed dynamic approach outperforms portfolio allocation based on other competitive approaches, such as the factor model and the linear or nonlinear shrinkage estimator.

Inference for Complex Survival Data

In survival analysis, I aim to develop tools and methodologies to improve estimation efficiency in semiparametric modeling for survival data and shed insights on the optimality theory as well as practitioners in the areas of health care.

In the joint work with K. Chen, L. Sun and J.L. Wang global partial likelihood for nonparametric proportional hazards model' (J. Amer. Statist. Assoc., 2010), we dealt with the problem of how to efficiently estimate $\psi(x)$ in nonparametric proportional hazards models. Through the decomposition of true partial likelihood, we observe that the efficiency for estimating $\psi(x)$ depends on how to utilize the information which involves all subjects who are at risks rather than the subjects with the covariate around the neighborhood of the point $x$. From this point of view, the situation of hazards-based model is quite different from ordinary nonparametric regression, where local smoothing often yields optimal procedures for nonparametric functions under regularity conditions. On the basis of this fact, we propose a global version of partial likelihood, called global partial likelihood.

The global partial likelihood estimator shares several good properties. First, unlike the local partial likelihood estimator, it estimates $\psi(x)$ directly. Second, its asymptotic properties such as consistency and asymptotic normality are derived. Third, also most importantly, it is efficient, while the local partial likelihood estimator is not. Moreover, as shown in the article, the proposed methodology can be readily extended to the partially linear proportional hazards model.

In addition, I and my co-author gave a selective overview for semiparametric models in survival analysis. This is the joint work with D. Zeng an overview for semiparametric models in survival analysis' (J. Statist. Plan. Infer., 2014). In this project, we discussed several survival models, including the proportional hazards models, proportional odds models, transformation models, from univariate to multivariate survival data.

Google Sites

Report abuse