Research

Research Summary

I am broadly interested in both the theoretical and applied aspects of statistical machine learning and data science. My recent and ongoing work includes

Theory on decision trees and random forests
Software and methodology for interpretable models and variable importance
Estimation of heterogeneous treatment effects and subgroup discovery in causal inference

My overall research goal is to advance methodology for high-stakes decision-making, and to apply these methods to improve healthcare and further social good.

I was trained as a mathematician, and have previously done work on probability theory, learning theory, signal-processing, and stochastic optimization.

Publications

Abhineet Agarwal*, Yan Shuo Tan*, Omer Ronen, Chandan Singh, Bin Yu, Hierarchical Shrinkage: improving the accuracy and interpretability of tree-based methods. ICML 2022 (long presentation) [conference paper]
Yan Shuo Tan, Abhineet Agarwal, Bin Yu, A cautionary tale on fitting decision trees to data from additive models: generalization lower bounds. AISTATS 2022. [arxiv]
Yan Shuo Tan, Roman Vershynin, Online Stochastic Gradient Descent with Arbitrary Initialization Solves Non-smooth, Non-convex Phase Retrieval. JMLR 2023. [arxiv]
Chandan Singh, Keyan Nasseri, Yan Shuo Tan, Tiffany Tang, and Bin Yu, imodels: a python package for fitting interpretable models. Journal of Open Source Software (2021). [journal paper]
John Lipor, David Hong, Yan Shuo Tan, Laura Balzano, Subspace Clustering using Ensembles of K-Subspaces. Information and Inference (2021). [journal paper]
Raaz Dwivedi*, Yan Shuo Tan*, Briton Park, Mian Wei, Kevin Horgan, David Madigan, Bin Yu, Stable discovery of interpretable subgroups via calibration in causal studies. International Statistical Review (2020). [journal paper]
Nick Altieri**, Rebecca L. Barter, James Duncan, Raaz Dwivedi, Karl Kumbier, Xiao Li, Robert Netzorg, Briton Park, Chandan Singh, Yan Shuo Tan, Tiffany Tang, Yu Wang, Chao Zhang, Bin Yu, Curating a COVID-19 data repository and forecasting county-level death counts in the United States. Harvard Data Science Review (2020). [journal paper]
Yan Shuo Tan, Roman Vershynin, Polynomial Time and Sample Complexity for Non-Gaussian Component Analysis: Spectral Methods. COLT 2018. [conference paper]
Yan Shuo Tan, Roman Vershynin, Phase Retrieval via Randomized Kaczmarz: Theoretical Guarantees. Information and Inference (2018). [journal paper]
Yan Shuo Tan, Energy Optimization for Distributions on the Sphere and Improvement to the Welch Bounds. Electronic Communications in Probability 22 (2017). [journal paper]

* denotes equal contribution, ** denotes alphabetical order

Preprints

Chen, Xin, Jason M. Klusowski, and Yan Shuo Tan. Error Reduction from Stacked Regressions. Submitted. [arxiv]
Yan Shuo Tan*, Chandan Singh*, Keyan Nasseri, Abhineet Agarwal, Bin Yu, Fast Interpretable Greedy-Tree Sums (FIGS). Submitted. [arxiv]
Yan Shuo Tan, Sparse Phase Retrieval via Sparse PCA Despite Model Misspecification: A Simplified and Extended Analysis. [arxiv]