I study causal inference and machine learning to advance pharmacoepidemiology, enable precision medicine, and drive innovation in healthcare.
Heterogeneity of causal effects
Synthetic data generation
Unmeasured confounding
Complex longitudinal data
Natural Language processing
Probabilistic bias analysis
Comparative treatment safety and effectiveness
Oncology treatment
Hormone therapy on transgender population
Precision medicine
Developed a two-stage natural language processing (NLP) pipeline for automated ascertainment of a large de-identified transgender and gender diverse (TGD) cohort using keyword-containing free text from electronic health records (EHR) across multiple Kaiser Permanente institutions. The first stage identified transgender identity, and the second determined natal sex to distinguish transfeminine and transmasculine individuals. NLP models provide an efficient, scalable approach for TGD identification, with transformer-based models and support vector machines achieving the best performance in the respective stages. Models were validated using an adaptive stratified design, demonstrating strong performance and potential transportability to external populations.
papers under review | stage 1-repo | stage 2-repo coming soon | stage 1-poster | stage 2-poster
Developed CausalMix, a synthetic data generator, combining conditional VAEs and Bayesian Mixture Models to simulate samples with potential outcomes under predefined conditional average treatment effects and unmeasured confounding. Implemented in PyTorch Lightning, the framework was evaluated using SDMetrics and Wasserstein distance to assess distributional similarity. The project highlights CausalMix’s potential for validating causal inference models and enabling bias-aware data augmentation to improve conditional treatment effect estimation.
Designed a simulation study to improve methods for adjusting exposure misclassification in epidemiologic research. Compared five approaches assigning prior distributions to the parameters of positive and negative predictive values derived from small validation studies. Results showed that applying a uniform prior beta significantly improved validity when validation data were sparse, while all methods performed similarly with sufficient data. This work highlights practical strategies for enhancing the validity of bias-adjusted measures in real-world studies with limited sample sizes.
Conducted a simulation study to evaluate two widely used methods for estimating conditional average treatment effects (CATEs): the double-robust (DR) learner and the causal forest algorithm. Both methods achieved strong confidence interval coverage overall, but the DR learner outperformed causal forests in scenarios with strong treatment effects and low heterogeneity. The study also revealed that identifying effect modifiers remains challenging, especially in smaller samples. These findings provide practical insights to guide method selection for estimating CATEs in empirical research.
paper and repo coming soon