Yu Yang

About Me

Hello! I'm Yu Yang (杨雨), a final-year Ph.D. student in Computer Science at University of California, Los Angeles (UCLA), where I am fortunate to be advised by Baharan Mirzasoleiman. My research primarily focuses on understanding and improving large-scale training data for efficient and robust learning.

I'm also a founding research scientist of Virtue AI, where I lead the evaluation and red-teaming for code generation models and agents.

I used to live in Beijing and Los Angeles, and I'm currently based in San Francisco.

Email: yuyang AT cs.ucla.edu

Awards

⭐ UCLA Dissertation Year Award, 2024

⭐ Amazon Doctoral Student Fellowship, 2022

⭐ UCLA Computer Science Fellowship, 2021

News

09/2024: One paper SmallToLarge (S2L): Scalable Data Selection for Fine-tuning Large Language Models by Summarizing Training Trajectories of Small Models has been accepted to NeurIPS 2024!
08/2024: Our paper AIR-Bench 2024 was covered by WIRED!
06/2024: I was selected to receive the Dissertation Year Award! ⭐
02/2024: Invited presentation at UCLA Research in the Age of AI Symposium.
01/2024: One paper Identifying Spurious Biases Early in Training through the Lens of Simplicity Bias has been accepted to AISTATS 2024!
01/2024: One paper Data Distillation Can Be Like Vodka: Distilling More Times For Better Quality has been accepted to ICLR 2024!

Experience

2024

Founding Research Scientist, Virtue AI

2023

Research Scientist Intern, AI Systems Machine Learning @ FAIR at Meta

- With Newsha Ardalani, Ari Morcos, Carole-Jean Wu
- Project: Decoding Data Quality via Synthetic Corruptions: Embedding-guided Pruning of Code Data
  [Paper]

2022

Research Intern, Robustness of Platform Models in Language and Vision @ Microsoft Research

- With Besmira Nushi, Hamid Palangi
- Project: Mitigating Spurious Correlations in Multi-modal Models during Fine-tuning
  [Paper @ ICML] [Code]

2021

Applied Scientist Intern, Computer Vision @ Amazon Alexa AI

- With Yue (Rex) Wu, Varsha Hedau
- Project: Enhancing Fairness in Face Detection in Computer Vision Systems by Demographic Bias Mitigation
  [Paper @ AIES] [Dataset Release]

Selected Publications [Full List]

2024

SmallToLarge (S2L): Scalable Data Selection for Fine-tuning Large Language Models by Summarizing Training Trajectories of Small Models
Yu Yang, Siddhartha Mishra, Jeffrey N Chiang, Baharan Mirzasoleiman
Accepted to Proceedings of the 38th Conference on Neural Information Processing Systems (NeurIPS), 2024.
[Preprint]

Few-shot Adaption to Distribution Shifts By Mixing Source and Target Embeddings
Yihao Xue, Ali Payani, Yu Yang, and Baharan Mirzasoleiman
In Proceedings of the 41st International Conference on Machine Learning (ICML), 2024.
[Paper]

Identifying Spurious Biases Early in Training through the Lens of Simplicity Bias
Yu Yang, Eric Gan, Gintare Karolina Dziugaite, Baharan Mirzasoleiman
In Proceedings of the 27th International Conference on Artificial Intelligence and Statistics (AISTATS), 2024.
[Paper] [Code]

Data Distillation Can Be Like Vodka: Distilling More Times For Better Quality
Yu Yang, Xuxi Chen, Zhangyang Wang, Baharan Mirzasoleiman (*Equal Contribution)
In Proceedings of the Twelfth International Conference on Learning Representations (ICLR), 2024.
[Preprint]

SIEVE: Multimodal Dataset Pruning Using Image Captioning Models
Anas Mahmoud, Mostafa Elhoushi, Amro Abbas, Yu Yang, Newsha Ardalani, Hugh Leather, Ari S Morcos
In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024
[Paper]

2023

Robust Learning with Progressive Data Expansion Against Spurious Correlation
Yu Yang, Yihe Deng, Baharan Mirzasoleiman, Quanquan Gu (*Equal Contribution)
In Proceedings of the 37th Conference on Neural Information Processing Systems (NeurIPS), 2023.
[Paper] [Code] [Project Page]

Decoding Data Quality via Synthetic Corruptions: Embedding-guided Pruning of Code Data
Yu Yang, Aaditya K Singh, Mostafa Elhoushi, Anas Mahmoud, Kushal Tirumala, Fabian Gloeckle, Baptiste Rozière, Carole-Jean Wu, Ari S Morcos, Newsha Ardalani
The 3rd Workshop on Efficient Natural Language and Speech Processing (ENLSP-III), NeurIPS 2023. (Oral)
[Paper]

CleanCLIP: Mitigating Data Poisoning Attacks in Multimodal Contrastive Learning
Hritik Bansal, Nishad Singhi, Yu Yang, Fan Yin, Aditya Grover, Kai-Wei Chang (*Equal Contribution)
In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023. (Oral: 1.8%)
[Paper] [Code]

Towards Sustainable Learning: Coresets for Data-efficient Deep Learning
Yu Yang, Hao Kang, Baharan Mirzasoleiman
In Proceedings of the 40th International Conference on Machine Learning (ICML), 2023.
[Paper] [Code]

Mitigating Spurious Correlations in Multi-modal Models during Fine-tuning
Yu Yang, Besmira Nushi, Hamid Palangi, Baharan Mirzasoleiman
In Proceedings of the 40th International Conference on Machine Learning (ICML), 2023.
[Paper] [Code]

2022

Not All Poisons are Created Equal: Robust Training against Data Poisoning
Yu Yang, Tian Yu Liu, Baharan Mirzasoleiman
In Proceedings of the International Conference on Machine Learning (ICML), 2022. (Oral: 2.10%)
[Paper] [Code]

Enhancing Fairness in Face Detection in Computer Vision Systems by Demographic Bias Mitigation
Yu Yang, Aayush Gupta, Jianwei Feng, Yue Rex Wu, Vivek Yadav, Varsha Hedau, Prateek Singhal, Pradeep Natarajan, Jungseock Joo
In Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society (AIES), 2022.
[Paper] [Dataset]

Explaining Deep Convolutional Neural Networks via Unsupervised Visual-Semantic Filter Attention
Yu Yang, Seungbae Kim, Jungseock Joo
In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022. (Oral: 4.22%)
[Paper] [Code]

Teaching

Teaching Assistant of COM SCI M148 - Introduction to Data Science, Winter 2024
Teaching Assistant of COM SCI M148 - Introduction to Data Science, Winter 2023
Reader of COM SCI M146 - Introduction to Machine Learning, MATH 32A - Calculus of Several Variables

Academic Activities

Reviewed for ICML 2022-2024, NeurIPS 2022-2023, ICLR 2023-2024, CVPR 2023-2024, ICCV 2023, AAAI 2024, Sparsity in Neural Networks Workshop 2021-2023, New Frontiers in Adversarial Machine Learning 2023

Yu Yang

About Me

Awards

News

Experience

2024

2023

2022

2021

Selected Publications [Full List]

2024

Identifying Spurious Biases Early in Training through the Lens of Simplicity BiasYu Yang, Eric Gan, Gintare Karolina Dziugaite, Baharan MirzasoleimanIn Proceedings of the 27th International Conference on Artificial Intelligence and Statistics (AISTATS), 2024.[Paper] [Code]

Data Distillation Can Be Like Vodka: Distilling More Times For Better QualityYu Yang*, Xuxi Chen*, Zhangyang Wang, Baharan Mirzasoleiman (*Equal Contribution)In Proceedings of the Twelfth International Conference on Learning Representations (ICLR), 2024.[Preprint]

2023

Robust Learning with Progressive Data Expansion Against Spurious CorrelationYu Yang*, Yihe Deng*, Baharan Mirzasoleiman, Quanquan Gu (*Equal Contribution)In Proceedings of the 37th Conference on Neural Information Processing Systems (NeurIPS), 2023.[Paper] [Code] [Project Page]

CleanCLIP: Mitigating Data Poisoning Attacks in Multimodal Contrastive LearningHritik Bansal*, Nishad Singhi*, Yu Yang, Fan Yin, Aditya Grover, Kai-Wei Chang (*Equal Contribution)In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023. (Oral: 1.8%)[Paper] [Code]

Towards Sustainable Learning: Coresets for Data-efficient Deep LearningYu Yang, Hao Kang, Baharan MirzasoleimanIn Proceedings of the 40th International Conference on Machine Learning (ICML), 2023.[Paper] [Code]

Mitigating Spurious Correlations in Multi-modal Models during Fine-tuningYu Yang, Besmira Nushi, Hamid Palangi, Baharan MirzasoleimanIn Proceedings of the 40th International Conference on Machine Learning (ICML), 2023.[Paper] [Code]

2022

Not All Poisons are Created Equal: Robust Training against Data Poisoning Yu Yang, Tian Yu Liu, Baharan Mirzasoleiman In Proceedings of the International Conference on Machine Learning (ICML), 2022. (Oral: 2.10%)[Paper] [Code]

Explaining Deep Convolutional Neural Networks via Unsupervised Visual-Semantic Filter AttentionYu Yang, Seungbae Kim, Jungseock JooIn Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022. (Oral: 4.22%)[Paper] [Code]

Teaching

Academic Activities

Identifying Spurious Biases Early in Training through the Lens of Simplicity Bias
Yu Yang, Eric Gan, Gintare Karolina Dziugaite, Baharan Mirzasoleiman
In Proceedings of the 27th International Conference on Artificial Intelligence and Statistics (AISTATS), 2024.
[Paper] [Code]

Data Distillation Can Be Like Vodka: Distilling More Times For Better Quality
Yu Yang, Xuxi Chen, Zhangyang Wang, Baharan Mirzasoleiman (*Equal Contribution)
In Proceedings of the Twelfth International Conference on Learning Representations (ICLR), 2024.
[Preprint]

Robust Learning with Progressive Data Expansion Against Spurious Correlation
Yu Yang, Yihe Deng, Baharan Mirzasoleiman, Quanquan Gu (*Equal Contribution)
In Proceedings of the 37th Conference on Neural Information Processing Systems (NeurIPS), 2023.
[Paper] [Code] [Project Page]

CleanCLIP: Mitigating Data Poisoning Attacks in Multimodal Contrastive Learning
Hritik Bansal, Nishad Singhi, Yu Yang, Fan Yin, Aditya Grover, Kai-Wei Chang (*Equal Contribution)
In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023. (Oral: 1.8%)
[Paper] [Code]

Towards Sustainable Learning: Coresets for Data-efficient Deep Learning
Yu Yang, Hao Kang, Baharan Mirzasoleiman
In Proceedings of the 40th International Conference on Machine Learning (ICML), 2023.
[Paper] [Code]

Mitigating Spurious Correlations in Multi-modal Models during Fine-tuning
Yu Yang, Besmira Nushi, Hamid Palangi, Baharan Mirzasoleiman
In Proceedings of the 40th International Conference on Machine Learning (ICML), 2023.
[Paper] [Code]

Not All Poisons are Created Equal: Robust Training against Data Poisoning
Yu Yang, Tian Yu Liu, Baharan Mirzasoleiman
In Proceedings of the International Conference on Machine Learning (ICML), 2022. (Oral: 2.10%)
[Paper] [Code]

Explaining Deep Convolutional Neural Networks via Unsupervised Visual-Semantic Filter Attention
Yu Yang, Seungbae Kim, Jungseock Joo
In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022. (Oral: 4.22%)
[Paper] [Code]