Program
Day and Time: March 3rd, 2025, 9:00 am - 5:00 pm
Location: Room 119A, Pennsylvania Convention Center, Philadelphia, PA, USA
Program
Event times shown in the schedule are local times in Philadelphia, PA, USA.
9:00-9:10 Opening Remarks (Laure Berti-Equille and Shiqiang Wang)
9:10-9:35 Invited Talk 1: GneissWeb: Open Innovation for Advancing Training Data Quality (David Cox, IBM)
9:35-10:00 Invited Talk 2: Better, Safer, and More Data for Foundation Model Training (Daniel Li, Meta)
10:00-10:30 Contributed Oral Presentation Session 1 (3 papers)
Yiqiao Jin, Yijia Xiao, Yiyang Wang, Jindong Wang. Scito2M: A 2 Million, 30-Year Cross-disciplinary Dataset for Temporal Scientometric Analysis (Best Paper Award).
Vibhor Agarwal, Yiqiao Jin, Mohit Chandra, Munmun De Choudhury, Srijan Kumar, Nishanth R. Sastry. MedHalu: Hallucinations in Responses to Healthcare Queries by Large Language Models.
YoonJe Kang, Jung Yonghoon, Wonseop Shin, Bumsoo Kim, Sanghyun Seo. MultiFloodSynth: Multi-Annotated Flood Synthetic Dataset Generation.
10:30-10:45 Poster Session 1 (5 posters)
Ruining Yang, Lili Su. Data Selection through Scenario-Balanced Coresets for Trajectory Prediction.
Robert J. Moore, Sungeun An, Jay Pankaj Gala, Divyesh Jadav. FineWeb-Conv: A Method for Finding Good Conversation Data.
Bhagyashree Puranik, Bugra Can, Yi Fan. Tabular Out-of-distribution Data Synthesis for Enhancing Robustness.
Masatoshi Sekine, Daisuke Shimbara, Tomoyuki Myojin, Eri Imatani. Enhancing Dataset Sufficiency for Attributes Through Text-Driven Generative Data Augmentation.
Guijin Son, Hyunwoo Ko, Hoyoung Lee, Yewon Kim, Seunghyeok Hong. Multilingual Challenges in Automated Evaluators: A Case Study on English and Korean.
10:45-11:00 Break
11:00-11:25 Invited Talk 3: Building Secure RAG Applications with Open LLM Models (Timothy Spann, Snowflake)
11:25-11:50 Invited Talk 4: Building Code Models with Reasoning (Baishakhi Ray, Columbia University)
11:50-12:10 Contributed Oral Presentation Session 2 (2 papers)
Masaya Hasegawa, Koji Yasuda. Quantifying the Ease of Reproducing Training Data in Unconditional Diffusion Models.
Fan Xie, Dan Zeng, Qiaomu Shen, Bo Tang. AbsText2Video: Embracing Abstract Annotations to Caption Video Dataset.
12:10-12:40 Poster Session 2 (10 posters)
Yiqiao Jin, Yijia Xiao, Yiyang Wang, Jindong Wang. Scito2M: A 2 Million, 30-Year Cross-disciplinary Dataset for Temporal Scientometric Analysis.
Vibhor Agarwal, Yiqiao Jin, Mohit Chandra, Munmun De Choudhury, Srijan Kumar, Nishanth R. Sastry. MedHalu: Hallucinations in Responses to Healthcare Queries by Large Language Models.
YoonJe Kang, Jung Yonghoon, Wonseop Shin, Bumsoo Kim, Sanghyun Seo. MultiFloodSynth: Multi-Annotated Flood Synthetic Dataset Generation.
Zeyu Zhang, Paul Groth, Iacer Calixto, Sebastian Schelter. ANYMATCH – Efficient Zero-Shot Entity Matching with a Small Language Model.
Kiril Bikov, Mikel Bober-Irizar, Soumya Banerjee. AugARC: Augmented Abstraction and Reasoning Benchmark for Large Language Models.
Masaya Hasegawa, Koji Yasuda. Quantifying the Ease of Reproducing Training Data in Unconditional Diffusion Models.
Enzo Ruedas, Baptiste Pouthier. HiRAG: Human-inspired Retrieval-Augmented Generation.
Mirali Purohit, Gedeon Muhawenayo, Esther Rolf, Hannah Kerner. How Does the Spatial Distribution of Pre-training Data Affect Geospatial Foundation Models?
Nirav Diwan, Tolga Ergen, Dongsub Shim, Honglak Lee. Clear Preferences Leave Traces: Reference Model-Guided Sampling for Preference Learning.
Fan Xie, Dan Zeng, Qiaomu Shen, Bo Tang. AbsText2Video: Embracing Abstract Annotations to Caption Video Dataset.
12:40-14:00 Lunch
14:00-14:25 Invited Talk 5: Can Generative AI be Egalitarian? (Shimei Pan & James Foulds, University of Maryland, Baltimore County)
14:25-14:50 Invited Talk 6: Annotating Common Crawl for Good (Greg Lindahl, Common Crawl Foundation) - Canceled
14:50-15:20 Contributed Oral Presentation Session 3 (3 papers)
Kshitij Singh Minhas, Qiao Wang, Niluthpol Chowdhury Mithun, Ben Southall, Supun Samarasekera, Rakesh Kumar. MWAG: Multi-Season Wide-Area Air Ground Dataset for 3D Scene Reconstruction and Novel View Synthesis.
Svetlana Churina, Kokil Jaidka. Fine-Tuning LLMs with noisy data for political argument generation.
Youyang kim, Yaoping Ruan, Byungchul Tak. Curating Online Forum Knowledge as Troubleshooting Dataset for Generative AI Using Fusion Retrieval (Best Paper Award).
15:20-15:50 Poster Session 3 (10 posters)
Sangyeon Cho, Mingi Kim, Hwang JinKwon, Jaehoon Go, Junyeong Kim. Improving Multimodal Data Quality with Unified Filtering Score (UF-Score).
Shivam Ratnakar, Abhiroop Talasila, Raghav Chamadiya, Nikhil Agarwal, Vinayak K Doifode. Beyond QA Pairs: Assessing Parameter-Efficient Fine-Tuning for Fact Embedding in LLMs.
Hedi Naouara, Jérôme Louradour, Jean-Pierre Lorré. LinTO Audio and Textual Datasets to Train and Evaluate Automatic Speech Recognition in Tunisian Arabic Dialect.
Yixuan Liang, Yuncong Liu, Boyu Zhang, Christina Dan Wang, Hongyang Yang. FinGPT: Enhancing Sentiment-Based Stock Movement Prediction with Dissemination-Aware and Context-Enriched LLMs.
Barbara Hoffmann, Ruben Mayer. Comparing Methods for Bias Mitigation in Graph Neural Networks.
Madeline Loui Anderson, Miriam Cha, William T. Freeman, J. Taylor Perron, Nathaniel Maidel, Kerri Cahoy. Measuring and Mitigating Hallucinations in Vision-Language Dataset Generation for Remote Sensing.
Kshitij Singh Minhas, Qiao Wang, Niluthpol Chowdhury Mithun, Ben Southall, Supun Samarasekera, Rakesh Kumar. MWAG: Multi-Season Wide-Area Air Ground Dataset for 3D Scene Reconstruction and Novel View Synthesis.
Tingyuan Zhu, Shudong Liu, Yidong Wang, Derek F. Wong, Han Yu, Takahiro Shinozaki, Jindong Wang. Learning from "Silly" Questions Improves Large Language Models, But Only Slightly.
Svetlana Churina, Kokil Jaidka. Fine-Tuning LLMs with noisy data for political argument generation.
Youyang Kim, Yaoping Ruan, Byungchul Tak. Curating Online Forum Knowledge as Troubleshooting Dataset for Generative AI Using Fusion Retrieval.
15:50-16:00 Break
16:00-16:55 Panel discussion: What's Next in Data for LLMs?
Panelists:
John McBroom (IBM)
Shimei Pan (University of Maryland, Baltimore County)
Yaoping Ruan (PARC AI LLC)
Moderator: Shiqiang Wang
16:55-17:00 Best paper announcement & closing
List of Accepted Papers and Oral Presentations
The poster presentations include all the accepted papers. In addition, selected top-rated papers will be also presented as oral presentations.
Selected Oral Presentations
Yiqiao Jin, Yijia Xiao, Yiyang Wang, Jindong Wang. Scito2M: A 2 Million, 30-Year Cross-disciplinary Dataset for Temporal Scientometric Analysis (Best Paper Award).
Vibhor Agarwal, Yiqiao Jin, Mohit Chandra, Munmun De Choudhury, Srijan Kumar, Nishanth R. Sastry. MedHalu: Hallucinations in Responses to Healthcare Queries by Large Language Models.
YoonJe Kang, Jung Yonghoon, Wonseop Shin, Bumsoo Kim, Sanghyun Seo. MultiFloodSynth: Multi-Annotated Flood Synthetic Dataset Generation.
Masaya Hasegawa, Koji Yasuda. Quantifying the Ease of Reproducing Training Data in Unconditional Diffusion Models.
Fan Xie, Dan Zeng, Qiaomu Shen, Bo Tang. AbsText2Video: Embracing Abstract Annotations to Caption Video Dataset.
Kshitij Singh Minhas, Qiao Wang, Niluthpol Chowdhury Mithun, Ben Southall, Supun Samarasekera, Rakesh Kumar. MWAG: Multi-Season Wide-Area Air Ground Dataset for 3D Scene Reconstruction and Novel View Synthesis.
Svetlana Churina, Kokil Jaidka. Fine-Tuning LLMs with noisy data for political argument generation.
Youyang kim, Yaoping Ruan, Byungchul Tak. Curating Online Forum Knowledge as Troubleshooting Dataset for Generative AI Using Fusion Retrieval (Best Paper Award).
All Accepted Papers
Ruining Yang, Lili Su. Data Selection through Scenario-Balanced Coresets for Trajectory Prediction.
Yiqiao Jin, Yijia Xiao, Yiyang Wang, Jindong Wang. Scito2M: A 2 Million, 30-Year Cross-disciplinary Dataset for Temporal Scientometric Analysis.
Vibhor Agarwal, Yiqiao Jin, Mohit Chandra, Munmun De Choudhury, Srijan Kumar, Nishanth R. Sastry. MedHalu: Hallucinations in Responses to Healthcare Queries by Large Language Models.
Robert J. Moore, Sungeun An, Jay Pankaj Gala, Divyesh Jadav. FineWeb-Conv: A Method for Finding Good Conversation Data.
Bhagyashree Puranik, Bugra Can, Yi Fan. Tabular Out-of-distribution Data Synthesis for Enhancing Robustness.
Masatoshi Sekine, Daisuke Shimbara, Tomoyuki Myojin, Eri Imatani. Enhancing Dataset Sufficiency for Attributes Through Text-Driven Generative Data Augmentation.
Guijin Son, Hyunwoo Ko, Hoyoung Lee, Yewon Kim, Seunghyeok Hong. Multilingual Challenges in Automated Evaluators: A Case Study on English and Korean.
YoonJe Kang, Jung Yonghoon, Wonseop Shin, Bumsoo Kim, Sanghyun Seo. MultiFloodSynth: Multi-Annotated Flood Synthetic Dataset Generation.
Zeyu Zhang, Paul Groth, Iacer Calixto, Sebastian Schelter. ANYMATCH – Efficient Zero-Shot Entity Matching with a Small Language Model.
Kiril Bikov, Mikel Bober-Irizar, Soumya Banerjee. AugARC: Augmented Abstraction and Reasoning Benchmark for Large Language Models.
Masaya Hasegawa, Koji Yasuda. Quantifying the Ease of Reproducing Training Data in Unconditional Diffusion Models.
Enzo Ruedas, Baptiste Pouthier. HiRAG: Human-inspired Retrieval-Augmented Generation.
Mirali Purohit, Gedeon Muhawenayo, Esther Rolf, Hannah Kerner. How Does the Spatial Distribution of Pre-training Data Affect Geospatial Foundation Models?
Nirav Diwan, Tolga Ergen, Dongsub Shim, Honglak Lee. Clear Preferences Leave Traces: Reference Model-Guided Sampling for Preference Learning.
Fan Xie, Dan Zeng, Qiaomu Shen, Bo Tang. AbsText2Video: Embracing Abstract Annotations to Caption Video Dataset.
Sangyeon Cho, Mingi Kim, Hwang JinKwon, Jaehoon Go, Junyeong Kim. Improving Multimodal Data Quality with Unified Filtering Score (UF-Score).
Shivam Ratnakar, Abhiroop Talasila, Raghav Chamadiya, Nikhil Agarwal, Vinayak K Doifode. Beyond QA Pairs: Assessing Parameter-Efficient Fine-Tuning for Fact Embedding in LLMs.
Hedi Naouara, Jérôme Louradour, Jean-Pierre Lorré. LinTO Audio and Textual Datasets to Train and Evaluate Automatic Speech Recognition in Tunisian Arabic Dialect.
Yixuan Liang, Yuncong Liu, Boyu Zhang, Christina Dan Wang, Hongyang Yang. FinGPT: Enhancing Sentiment-Based Stock Movement Prediction with Dissemination-Aware and Context-Enriched LLMs.
Barbara Hoffmann, Ruben Mayer. Comparing Methods for Bias Mitigation in Graph Neural Networks.
Madeline Loui Anderson, Miriam Cha, William T. Freeman, J. Taylor Perron, Nathaniel Maidel, Kerri Cahoy. Measuring and Mitigating Hallucinations in Vision-Language Dataset Generation for Remote Sensing.
Kshitij Singh Minhas, Qiao Wang, Niluthpol Chowdhury Mithun, Ben Southall, Supun Samarasekera, Rakesh Kumar. MWAG: Multi-Season Wide-Area Air Ground Dataset for 3D Scene Reconstruction and Novel View Synthesis.
Tingyuan Zhu, Shudong Liu, Yidong Wang, Derek F. Wong, Han Yu, Takahiro Shinozaki, Jindong Wang. Learning from "Silly" Questions Improves Large Language Models, But Only Slightly.
Svetlana Churina, Kokil Jaidka. Fine-Tuning LLMs with noisy data for political argument generation.
Youyang kim, Yaoping Ruan, Byungchul Tak. Curating Online Forum Knowledge as Troubleshooting Dataset for Generative AI Using Fusion Retrieval.