(This was me before I started my Ph.D. so don't trust it too much :))
Hello! I'm Yu Yang (杨雨). I work at OpenAI on the reasoning team. 🍓
I received my Ph.D. in Computer Science from the University of California, Los Angeles (UCLA) in 2024. My research explores how training data quality, distribution, and curriculum influence model performance & training efficiency. On these topics, I collaborated with researchers from Google DeepMind, Microsoft Research, Meta FAIR, and NVIDIA Research.
I was honored to receive the UCLA Dissertation Award and Outstanding Graduate Student Research Award.
I studied mathematics, statistics, and a little brain and mind in college (also at UCLA).
I used to live in Beijing and Los Angeles. Now I'm based in San Francisco.
Work: yuyang AT openai DOT com
AIR-BENCH 2024: A Safety Benchmark Based on Regulation and Policies Specified Risk Categories
Yi Zeng*, Yu Yang*, Andy Zhou*, Jeffrey Ziwei Tan*, Yuheng Tu*, Yifan Mai, Kevin Klyman, Minzhou Pan, Ruoxi Jia, Dawn Song, Percy Liang, Bo Li (*Equal Contribution)
In Proceedings of the Thirteenth International Conference on Learning Representations (ICLR), 2025. (Spotlight: 5.1%)
[OpenReview]
SmallToLarge (S2L): Scalable Data Selection for Fine-tuning Large Language Models by Summarizing Training Trajectories of Small Models
Yu Yang, Siddhartha Mishra, Jeffrey N Chiang, Baharan Mirzasoleiman
Accepted to Proceedings of the 38th Conference on Neural Information Processing Systems (NeurIPS), 2024.
[Preprint]
Identifying Spurious Biases Early in Training through the Lens of Simplicity Bias
Yu Yang, Eric Gan, Gintare Karolina Dziugaite, Baharan Mirzasoleiman
In Proceedings of the 27th International Conference on Artificial Intelligence and Statistics (AISTATS), 2024.
[Paper] [Code]
Data Distillation Can Be Like Vodka: Distilling More Times For Better Quality
Xuxi Chen*, Yu Yang*, Zhangyang Wang, Baharan Mirzasoleiman (*Equal Contribution)
In Proceedings of the Twelfth International Conference on Learning Representations (ICLR), 2024.
[Preprint]
Robust Learning with Progressive Data Expansion Against Spurious Correlation
Yihe Deng*, Yu Yang*, Baharan Mirzasoleiman, Quanquan Gu (*Equal Contribution)
In Proceedings of the 37th Conference on Neural Information Processing Systems (NeurIPS), 2023.
[Paper] [Code] [Project Page]
Decoding Data Quality via Synthetic Corruptions: Embedding-guided Pruning of Code Data
Yu Yang, Aaditya K Singh, Mostafa Elhoushi, Anas Mahmoud, Kushal Tirumala, Fabian Gloeckle, Baptiste Rozière, Carole-Jean Wu, Ari S Morcos, Newsha Ardalani
The 3rd Workshop on Efficient Natural Language and Speech Processing (ENLSP-III), NeurIPS 2023. (Oral)
[Paper]
Towards Sustainable Learning: Coresets for Data-efficient Deep Learning
Yu Yang, Hao Kang, Baharan Mirzasoleiman
In Proceedings of the 40th International Conference on Machine Learning (ICML), 2023.
[Paper] [Code]
Mitigating Spurious Correlations in Multi-modal Models during Fine-tuning
Yu Yang, Besmira Nushi, Hamid Palangi, Baharan Mirzasoleiman
In Proceedings of the 40th International Conference on Machine Learning (ICML), 2023.
[Paper] [Code]
Not All Poisons are Created Equal: Robust Training against Data Poisoning
Yu Yang, Tian Yu Liu, Baharan Mirzasoleiman
In Proceedings of the International Conference on Machine Learning (ICML), 2022. (Oral: 2.10%)
[Paper] [Code]
Explaining Deep Convolutional Neural Networks via Unsupervised Visual-Semantic Filter Attention
Yu Yang, Seungbae Kim, Jungseock Joo
In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022. (Oral: 4.22%)
[Paper] [Code]