Yan Shuo Tan

I am currently an assistant professor at the Department of Statistics and Data Science at the National University of Singapore. I was previously a Neyman Visiting Assistant Professor at UC Berkeley's Statistics Department, where I was fortunate to be advised by Bin Yu. I did my PhD in Mathematics at the University of Michigan, where I was fortunate to be advised by Roman Vershynin and Anna Gilbert.

My current research is on statistical machine learning, focusing on the theory, methodology and applications of modeling with tree-based models and randomized ensembles.

Email: yanshuo@nus.edu.sg

CV: last updated Aug 2024

Google scholar

Highlighted publications

Chen, Xin**, Jason M. Klusowski**, Yan Shuo Tan**, Chang Yu**. Revisiting Randomization in Greedy Model Search. [arxiv]
Qiong Zhang*, Yan Shuo Tan*, Qinglong Tian*, Pengfei Li,. TabPFN: One Model to Rule Them All? Major revision at JASA. [arxiv]
Qiong Zhang*, Yan Shuo Tan*, and Jiahua Chen. Byzantine-tolerant distributed learning of finite mixture models. Submitted to JRSSB. [arxiv]
Yan Shuo Tan, Jason M. Klusowski, Krishnakumar Balasubramanian. Statistical-Computational Trade-offs for Recursive Adaptive Partitioning Estimators. Annals of Statistics (to appear 2025+). [arxiv]
Jean Feng, Avni Kothari, Luke Zier, Chandan Singh, Yan Shuo Tan. Bayesian Concept Bottleneck Models with LLM Priors. NeurIPS 2025. [arxiv]
Yan Shuo Tan, Omer Ronen, Theo Saarinen, and Bin Yu. The Computational Curse of Big Data for Bayesian Additive Regression Trees: A Hitting Time Analysis. R&R at Annals of Statistics. [arxiv]
Chen, Xin, Jason M. Klusowski, and Yan Shuo Tan. Error Reduction from Stacked Regressions. [arxiv]
Abhineet Agarwal*, Yan Shuo Tan*, Omer Ronen, Chandan Singh, Bin Yu, Hierarchical Shrinkage: improving the accuracy and interpretability of tree-based methods. ICML 2022 (long presentation) [conference paper]
Yan Shuo Tan*, Chandan Singh*, Keyan Nasseri*, Abhineet Agarwal*, James Duncan, Omer Ronen, Matthew Epland, Aaron Kornblith, Bin Yu, Fast Interpretable Greedy-Tree Sums (FIGS). PNAS 2025 [arxiv]
Yan Shuo Tan, Abhineet Agarwal, Bin Yu, A cautionary tale on fitting decision trees to data from additive models: generalization lower bounds. AISTATS 2022. [arxiv]
Yan Shuo Tan, Roman Vershynin, Online Stochastic Gradient Descent with Arbitrary Initialization Solves Non-smooth, Non-convex Phase Retrieval. JMLR (2023). [journal paper]

Publications

Yan Shuo Tan, Jason M. Klusowski, Krishnakumar Balasubramanian. Statistical-Computational Trade-offs for Recursive Adaptive Partitioning Estimators. Annals of Statistics (to appear 2025+). [arxiv]
Jean Feng, Avni Kothari, Luke Zier, Chandan Singh, Yan Shuo Tan. Bayesian Concept Bottleneck Models with LLM Priors. NeurIPS 2025. [arxiv]
Yan Shuo Tan*, Chandan Singh*, Keyan Nasseri*, Abhineet Agarwal*, James Duncan, Omer Ronen, Matthew Epland, Aaron Kornblith, Bin Yu, Fast Interpretable Greedy-Tree Sums (FIGS). PNAS 2025 [arxiv]
Abhineet Agarwal*, Yan Shuo Tan*, Omer Ronen, Chandan Singh, Bin Yu, Hierarchical Shrinkage: improving the accuracy and interpretability of tree-based methods. ICML 2022 (long presentation) [conference paper]
Yan Shuo Tan, Abhineet Agarwal, Bin Yu, A cautionary tale on fitting decision trees to data from additive models: generalization lower bounds. AISTATS 2022. [arxiv]
Yan Shuo Tan, Roman Vershynin, Online Stochastic Gradient Descent with Arbitrary Initialization Solves Non-smooth, Non-convex Phase Retrieval. JMLR (2023). [journal paper]
Chandan Singh, Keyan Nasseri, Yan Shuo Tan, Tiffany Tang, and Bin Yu, imodels: a python package for fitting interpretable models. Journal of Open Source Software (2021). [journal paper]
John Lipor, David Hong, Yan Shuo Tan, Laura Balzano, Subspace Clustering using Ensembles of K-Subspaces. Information and Inference (2021). [journal paper]
Raaz Dwivedi*, Yan Shuo Tan*, Briton Park, Mian Wei, Kevin Horgan, David Madigan, Bin Yu, Stable discovery of interpretable subgroups via calibration in causal studies. International Statistical Review (2020). [journal paper]
Nick Altieri**, Rebecca L. Barter, James Duncan, Raaz Dwivedi, Karl Kumbier, Xiao Li, Robert Netzorg, Briton Park, Chandan Singh, Yan Shuo Tan, Tiffany Tang, Yu Wang, Chao Zhang, Bin Yu, Curating a COVID-19 data repository and forecasting county-level death counts in the United States. Harvard Data Science Review (2020). [journal paper]
Yan Shuo Tan, Roman Vershynin, Polynomial Time and Sample Complexity for Non-Gaussian Component Analysis: Spectral Methods. COLT 2018. [conference paper]
Yan Shuo Tan, Roman Vershynin, Phase Retrieval via Randomized Kaczmarz: Theoretical Guarantees. Information and Inference (2018). [journal paper]
Yan Shuo Tan, Energy Optimization for Distributions on the Sphere and Improvement to the Welch Bounds. Electronic Communications in Probability 22 (2017). [journal paper]

* denotes equal contribution, ** denotes alphabetical order

Preprints

Chen, Xin**, Jason M. Klusowski**, Yan Shuo Tan**, Chang Yu**. Revisiting Randomization in Greedy Model Search. [arxiv]
Qiong Zhang*, Yan Shuo Tan*, Qinglong Tian*, Pengfei Li,. TabPFN: One Model to Rule Them All? Submitted to JASA. [arxiv]
Qiong Zhang*, Yan Shuo Tan*, and Jiahua Chen. Byzantine-tolerant distributed learning of finite mixture models. R&R at JRSSB. [arxiv]
Yan Shuo Tan, Omer Ronen, Theo Saarinen, and Bin Yu. The Computational Curse of Big Data for Bayesian Additive Regression Trees: A Hitting Time Analysis. Submitted to Annals of Statistics. [arxiv]
Chen, Xin, Jason M. Klusowski, and Yan Shuo Tan. Error Reduction from Stacked Regressions. Submitted to Annals of Statistics. [arxiv]
Abhineet Agarwal*, Ana M. Kenney*, Yan Shuo Tan*, Tiffany M. Tang*, Bin Yu, MDI+: A Flexible Random Forest-Based Feature Importance Framework. Under revision. [arxiv]
Omer Ronen*, Theo Saarinen*, Yan Shuo Tan*, James Duncan, Bin Yu, A mixing time lower bound for a simplified version of BART. Submitted. [arxiv]
Yan Shuo Tan*, Chandan Singh*, Keyan Nasseri*, Abhineet Agarwal*, James Duncan, Omer Ronen, Matthew Epland, Aaron Kornblith, Bin Yu, Fast Interpretable Greedy-Tree Sums (FIGS). Submitted to PNAS [arxiv]
Yan Shuo Tan, Sparse Phase Retrieval via Sparse PCA Despite Model Misspecification: A Simplified and Extended Analysis. [arxiv]

Google Sites

Report abuse