Yan Shuo Tan
I am currently an assistant professor at the Department of Statistics and Data Science at the National University of Singapore. I was previously a Neyman Visiting Assistant Professor at UC Berkeley's Statistics Department, where I was fortunate to be advised by Bin Yu. I did my PhD in Mathematics at the University of Michigan, where I was fortunate to be advised by Roman Vershynin and Anna Gilbert.
My current research is in statistical machine learning, focusing on the theory, methodology and applications of modeling with decision trees and tree ensembles.
Highlighted publications
Omer Ronen*, Theo Saarinen*, Yan Shuo Tan*, James Duncan, Bin Yu, A mixing time lower bound for a simplified version of BART. Submitted. [arxiv]
Abhineet Agarwal*, Yan Shuo Tan*, Omer Ronen, Chandan Singh, Bin Yu, Hierarchical Shrinkage: improving the accuracy and interpretability of tree-based methods. ICML 2022 (long presentation) [conference paper]
Yan Shuo Tan*, Chandan Singh*, Keyan Nasseri, Abhineet Agarwal, Bin Yu, Fast Interpretable Greedy-Tree Sums (FIGS). [arxiv]
Yan Shuo Tan, Abhineet Agarwal, Bin Yu, A cautionary tale on fitting decision trees to data from additive models: generalization lower bounds. AISTATS 2022. [arxiv]
Raaz Dwivedi*, Yan Shuo Tan*, Briton Park, Mian Wei, Kevin Horgan, David Madigan, Bin Yu, Stable discovery of interpretable subgroups via calibration in causal studies. International Statistical Review (2020). [journal paper]
Yan Shuo Tan, Roman Vershynin, Online Stochastic Gradient Descent with Arbitrary Initialization Solves Non-smooth, Non-convex Phase Retrieval. Accepted by JMLR. [arxiv]
Research Summary
I am broadly interested in both the theoretical and applied aspects of statistical machine learning and data science. My recent and ongoing work includes
Theory on decision trees and random forests
Software and methodology for interpretable models and variable importance
Estimation of heterogeneous treatment effects and subgroup discovery in causal inference
My overall research goal is to advance methodology for high-stakes decision-making, and to apply these methods to improve healthcare and further social good.
I was trained as a mathematician, and have previously done work on probability theory, learning theory, signal-processing, and stochastic optimization.
Publications
Omer Ronen*, Theo Saarinen*, Yan Shuo Tan*, James Duncan, Bin Yu, A mixing time lower bound for a simplified version of BART. Submitted. [arxiv]
Abhineet Agarwal*, Yan Shuo Tan*, Omer Ronen, Chandan Singh, Bin Yu, Hierarchical Shrinkage: improving the accuracy and interpretability of tree-based methods. ICML 2022 (long presentation) [conference paper]
Yan Shuo Tan, Abhineet Agarwal, Bin Yu, A cautionary tale on fitting decision trees to data from additive models: generalization lower bounds. AISTATS 2022. [arxiv]
Yan Shuo Tan, Roman Vershynin, Online Stochastic Gradient Descent with Arbitrary Initialization Solves Non-smooth, Non-convex Phase Retrieval. Accepted by JMLR. [arxiv]
Chandan Singh, Keyan Nasseri, Yan Shuo Tan, Tiffany Tang, and Bin Yu, imodels: a python package for fitting interpretable models. Journal of Open Source Software (2021). [journal paper]
John Lipor, David Hong, Yan Shuo Tan, Laura Balzano, Subspace Clustering using Ensembles of K-Subspaces. Information and Inference (2021). [journal paper]
Raaz Dwivedi*, Yan Shuo Tan*, Briton Park, Mian Wei, Kevin Horgan, David Madigan, Bin Yu, Stable discovery of interpretable subgroups via calibration in causal studies. International Statistical Review (2020). [journal paper]
Nick Altieri**, Rebecca L. Barter, James Duncan, Raaz Dwivedi, Karl Kumbier, Xiao Li, Robert Netzorg, Briton Park, Chandan Singh, Yan Shuo Tan, Tiffany Tang, Yu Wang, Chao Zhang, Bin Yu, Curating a COVID-19 data repository and forecasting county-level death counts in the United States. Harvard Data Science Review (2020). [journal paper]
Yan Shuo Tan, Roman Vershynin, Polynomial Time and Sample Complexity for Non-Gaussian Component Analysis: Spectral Methods. COLT 2018. [conference paper]
Yan Shuo Tan, Roman Vershynin, Phase Retrieval via Randomized Kaczmarz: Theoretical Guarantees. Information and Inference (2018). [journal paper]
Yan Shuo Tan, Energy Optimization for Distributions on the Sphere and Improvement to the Welch Bounds. Electronic Communications in Probability 22 (2017). [journal paper]
* denotes equal contribution, ** denotes alphabetical order
Preprints
Past Teaching
University of California, Berkeley (2018-)
Co-Instructor for Data 102 - Spring 2021
Instructor for Stat 210B (Graduate Theoretical Statistics II) - Spring 2020
University of Michigan (2013-2018)
Instructor for Math 115 (Differential Calculus) - Fall 2013, Fall 2014, Winter 2016
Instructor for Math 116 (Integral Calculus) - Winter 2014, Winter 2015
Instructor for Math 110 (Accelerated Precalculus) - Fall 2015
Course co-coordinator for Math 105 (Precalculus) - Fall 2017
Teaching assistant for Math 215 (Multivariate Calculus) - Summer 2014
Teaching assistant for Math 216 (Differential Equations) - Winter 2018
Miscellaneous
Teaching assistant for Roman Vershynin's mini-course on probabilistic methods for data science at PCMI 2016.