Speaker: Chenyin Gao, Department of Statistics, North Carolina State University
Abstract: Multiple heterogeneous data sources are becoming increasingly available for statistical analyses in the era of big data. As an important example in finite-population inference, we develop a unified framework of the test-and-pool (TAP) approach to general parameter estimation by combining gold-standard probability (PR) and non-probability (NPR) samples.
We focus on the case when the study variable is observed in both PR and NPR data for estimating the target parameters, and each contains other auxiliary variables. Utilizing
the probability design, we conduct a pretest procedure to determine the comparability of the NPR data with the PR data and decide whether or not to leverage the NPR data in a pooled analysis. When the PR and NPR data are comparable, our approach combines both data for efficient estimation. Otherwise, we retain only the PR data for estimation. We also characterize the asymptotic distribution of the proposed TAP estimator under local alternative and provide a data-adaptive procedure to select the critical tuning parameters that target the smallest mean square error of the TAP estimator. Lastly, to deal with the non-regularity of the TAP estimator, we construct a robust confidence interval that has a good finite-sample coverage property.
Speaker: Wendy Ye, Real World Analytics, Eli Lilly and Company
Abstract: Real-world evidence is playing an increasing role in health care decisions and gaining interests for regulatory use. However, the breadth and heterogeneity of real-world data poses statistical challenges in generating real-world evidence. In this talk, I will present the implication of real-world evidence in drug development and common statistical methods to address challenges in real-world analytics.
Speaker: Shuxiao Chen, Department of Statistics and Data Science, University of Pennsylvania
Abstract: Randomized controlled trials (RCTs) are the gold standard for evaluating the causal effect of a treatment; however, they often have limited sample sizes and sometimes poor generalizability. On the other hand, non-randomized, observational data derived from large administrative databases have massive sample sizes and better generalizability, but they are prone to unmeasured confounding bias. It is thus of considerable interest to reconcile effect estimates obtained from randomized controlled trials and observational studies investigating the same intervention, potentially harvesting the best from both realms. In this paper, we theoretically characterize the potential efficiency gain of integrating observational data into the RCT-based analysis from a minimax point of view. For estimation, we derive the minimax rate of convergence for the mean squared error, and propose a fully adaptive anchored thresholding estimator that attains the optimal rate up to poly-log factors. For inference, we characterize the minimax rate for the length of confidence intervals and show that adaptation (to unknown confounding bias) is in general impossible. A curious phenomenon thus emerges: for estimation, the efficiency gain from data integration can be achieved without prior knowledge on the magnitude of the confounding bias; for inference, the same task becomes information-theoretically impossible in general.
Speaker: Yichi Zhang, Department of Statistics, North Carolina State University
Abstract: The conditional average treatment effect (CATE) estimation is a fundamental topic in causal inference, which plays a crucial role in the optimal treatment allocation, subgroup analysis, etc. In this talk, we will first review some recent methodologies regarding the CATE estimation with binary treatment, including the S-learner, X-learner, meta-learner, R-learner, and DR-learner. We point out that some very recent method, like the R-learner, is rooted in minimizing Neyman orthogonal loss, which subtracts the effect of nuisance function estimation and turns out a doubly robust estimator. We then generalize the CATE estimation problem to the continuous treatment regime. We find the CATE is no longer identifiable by minimizing an R-learner-type loss under the continuous regime. To resolve this issue, we introduce a generalized R-learner with a B-spline approximation and L_2 penalty. We show the proposed estimator can identify the CATE under the continuous regime. Additional theoretical properties are derived, including the rate double robustness and the L^2 rate of convergence.
Speaker: Siyi Liu, Department of Statistics, North Carolina State University
Abstract: Longitudinal studies are often subject to missing data. The ICH E9(R1) addendum addresses the importance of defining a treatment effect estimand with the consideration of intercurrent events. Jump-to-reference (J2R) is one classically envisioned control-based scenario for the treatment effect evaluation, where the participants in the treatment group after intercurrent events are assumed to have the same disease progress as those with identical covariates in the control group. We establish new estimators to assess the average treatment effect based on a proposed potential outcomes framework under J2R. Various identification formulas are constructed under the assumptions addressed by J2R, motivating estimators that rely on different parts of the observed data distribution. Moreover, we obtain a novel estimator inspired by the efficient influence function, with multiple robustness in the sense that it achieves n1/2-consistency if any pairs of multiple nuisance functions are correctly specified, or if the nuisance functions converge at a rate not slower than n-1/4 when using flexible modeling approaches. The finite-sample performance of the proposed estimators is validated in simulation studies and an antidepressant clinical trial.
Speaker: Tanchumin Xu, Department of Statistics, North Carolina State University
Abstract: Propensity score matching (PSM) and augmented inverse propensity weighting (AIPW) are widely used in observational studies to estimate causal effects. The two approaches present complementary features. The AIPW estimator is doubly robust and achieves the semiparametric efficiency bound but can be unstable when the propensity scores are close to 0 or 1 due to weighting by the inverse of the propensity score. On the other hand, PSM circumvents the instability of propensity score weighting but it hinges on the correctness of the propensity score model and cannot attain the semiparametric efficiency bound. Besides, the fixed number of matches, K, renders PSM nonsmooth and thus invalidity of the nonparametric bootstrap inference.
This article presents novel augmented match weighted (AMW) estimators that combine the advantages of matching and weighting estimators. AMW adheres to the form of AIPW for its double robustness and local efficiency, but to mitigate the instability due to weighting, we replace inverse propensity score weights by matching weights resulting from PSM with unfixed K. Unfixed K is chosen by cross-validation to make variance estimation appropriate via naive bootstrap. We derive the limiting distribution for the AMW estimators to show that they achieve semiparametric efficiency bound and enjoy double robustness property. Furthermore, simulation studies and real data applications support that the AMW estimators are stable with extreme propensity scores and their variances can be obtained by naive bootstrap.
Speaker: Eunah Cho, Department of Statistics, North Carolina State University
Abstract: In the causal analysis, the real-word-data (RWD) and randomized controlled trials (RCT) complement each other to obtain a consistent estimate in a target population. We first focus on confounding control on the RWD. Confounding control is crucial and yet challenging for causal inference based on observational studies. Under the typical unconfoundness assumption, augmented inverse probability weighting (AIPW) has been popular for estimating the average causal effect (ACE) due to its double robustness. Many recent works recommend selecting all outcome predictors for both confounding control and efficient estimation. We show that the AIPW estimator with variable selection targeted for efficient estimation may lose the desirable double robustness property. Instead, we propose controlling the propensity score model for any covariate that is a predictor of either the treatment or the outcome or both, which preserves the double robustness of the AIPW estimator. We propose a two-stage procedure with penalization for variable selection and the AIPW estimator for estimation using this principle. We show that the proposed procedure benefits from the desirable double robustness property. We evaluate the finite-sample performance of the AIPW estimator with various variable selection criteria through simulation and an application. Secondly, we consider the integration of the RCT and RWD. The RCT is often not representative of the target population, which leads to biased ACE. To correct the selection bias, we introduce the RWD and calibrate the covariate balance between the RCT and RWD. In this process, it is important to use consistent estimates from the RWD. If the RWD is subject to the missingness and we use only the complete part of the RWD, the ACE from the complete part can be biased due to the certain missing mechanism. We propose a double calibration weighting estimator that corrects the selection bias of the RCT and handles the missingness of covariates in the RWD. Then, we propose an augmented calibration weighting estimator using the outcome mean model to improve the efficiency. Furthermore, we add another augmentation to make the estimator robust against the misspecified missing mechanism model if we can estimate the correct reduced outcome mean model based on the fully observed samples only. Then, we compare the bias and efficiency among the suggested three estimators and existing methods.
Speaker: Dasom Lee, Department of Statistics, North Carolina State University
Abstract: Randomized controlled trials (RCTs) and real-world data (RWD) are often considered mutually complementary data sources. In this talk, we present various methods of analyzing both data sources under Bayesian and Causal inference frameworks. We first focus on RCT and RWD data where binary-valued patient outcomes are measured asynchronously over time across various dose levels. To account for autocorrelation among such longitudinally observed outcomes and asynchrony in observed time points, we propose a nonhomogeneous first-order Markov model under a flexible nonparametric Bayesian framework where the transition probabilities are modeled using B-spline basis functions after suitable transformations. We then consider an integrative approach when both data sources are available. We use complementary features of RCTs and RWD to estimate the average treatment effect of a target population. We propose a calibration weighting estimator that enforces the covariate balance between the RCT and RWD, therefore improving the trial-based estimator's generalizability. Moreover, by exploiting semiparametric efficiency theory, we propose a doubly robust augmented calibration weighting estimator that achieves the efficiency bound derived under the identification assumptions. Lastly, we discuss the extension of this integrative approach to survival outcomes. Specifically, we consider a broad class of estimands that are functional of treatment-specific survival functions, including differences in survival probability and restricted mean survival times, under guidence of semiparametric theory.
Ref: D. Lee, S. Yang, L. Dong, X. Wang, D. Zeng, J.W. Cai (2021). Improving trial generalizability using observational studies, Biometrics, doi:10.1111/biom.13609.