Accepted papers can also be read on OpenReview.
The archival proceedings is online on the Communications in Computer and Information Science (CCIS).
Yanzhen Shen, Yu Zhang, Yunyi Zhang, Jiawei Han
Scientific taxonomy plays a crucial role in organizing and structuring scientific knowledge across various fields like Medical Science and Computer Science. With the rapid advancement of scientific research and the emergence of new scientific concepts, people have also sought to automatically populate an existing taxonomy. Entity set expansion, taxonomy expansion, and seed-guided taxonomy construction are three representative tasks that can be applied to automatic taxonomy construction. Previous studies view them as three separate tasks. Therefore, their proposed techniques usually work for one specific task only, lacking generalizability and a holistic perspective. In this paper, we aim at a unified solution to the three tasks. To be specific, we identify two common skills needed for entity set expansion, taxonomy expansion, and seed-guided taxonomy construction: finding siblings and finding parents. We propose a taxonomy-guided instruction tuning framework to teach a large language model to generate siblings and parents for query entities, where the joint pre-training process facilitates the mutual enhancement of the two skills. Extensive experiments on multiple benchmark datasets demonstrate the efficacy of our proposed TaxoInstruct framework, which outperforms task-specific baselines across all three tasks.
Learning to Generate Research Idea with Dynamic Control
Ruochen Li, Liqiang Jing, Xinya Du
The rapid advancements in large language models (LLMs) have demonstrated their potential to accelerate scientific discovery, particularly in automating the process of research ideation. LLM-based systems have shown promise in generating hypotheses and research ideas. However, current approaches predominantly rely on prompting-based pre-trained models, limiting their ability to optimize generated content effectively. Moreover, they also lack the capability to deal with the complex interdependence and inherent restrictions among novelty, feasibility, and effectiveness, which remains challenging due to the inherent trade-offs among these dimensions, such as the innovation-feasibility conflict. To address these limitations, we propose a novel framework that employs a two-stage approach combining Supervised Fine-Tuning (SFT) and controllable Reinforcement Learning (RL). In the SFT stage, the model learns foundational patterns from pairs of research papers and follow-up ideas. In the RL stage, multi-dimensional reward modeling, guided by fine-grained feedback, evaluates and optimizes the generated ideas across key metrics. A dimensional controller enables dynamic adjustment of generation, while a sentence-level decoder ensures context-aware emphasis during inference. Our framework provides a balanced approach to research ideation, achieving high-quality outcomes by dynamically navigating the trade-offs among novelty, feasibility, and effectiveness.
AAAR-1.0: Assessing AI's Potential to Assist Research
Renze Lou, Hanzi Xu, Sijia Wang, Jiangshu Du, Ryo Kamoi, Xiaoxin Lu, Jian Xie, Yuxuan Sun, Yusen Zhang, Jihyun Janice Ahn, Hongchao Fang, Zhuoyang Zou, Wenchao Ma, Xi Li, Kai Zhang, Congying Xia, Lifu Huang, Wenpeng Yin
Numerous studies have assessed the proficiency of AI systems, particularly large language models (LLMs), in facilitating everyday tasks such as email writing, question answering, and creative content generation. However, researchers face unique challenges and opportunities in leveraging LLMs for their own work, such as brainstorming research ideas, designing experiments, and writing or reviewing papers. In this study, we introduce AAAR-1.0, a benchmark dataset designed to evaluate LLM performance in three fundamental, expertise-intensive research tasks: (i) EquationInference, assessing the correctness of equations based on the contextual information in paper submissions; (ii) ExperimentDesign, designing experiments to validate research ideas and solutions; and (iii) PaperWeakness, identifying weaknesses in paper submissions. AAAR-1.0 differs from prior benchmarks in two key ways: first, it is explicitly research-oriented, with tasks requiring deep domain expertise; second, it is researcher-oriented, mirroring the primary activities that researchers engage in on a daily basis. An evaluation of both open-source and proprietary LLMs reveals their potential as well as limitations in conducting sophisticated research tasks. We will keep iterating AAAR-1.0 to new versions.
Large Language Models for Rediscovering Unseen Chemistry Scientific Hypotheses
Zonglin Yang, Wanhao Liu, Ben Gao, Tong Xie, Yuqiang Li, Wanli Ouyang, Soujanya Poria, Erik Cambria, Dongzhan Zhou
Scientific discovery contributes largely to the prosperity of human society, and recent progress shows that LLMs could potentially catalyst the process. However, it is still unclear whether LLMs can discover novel and valid hypotheses in chemistry. In this work, we investigate this main research question: whether LLMs can automatically discover novel and valid chemistry research hypotheses, given only a research question? With extensive discussions with chemistry experts, we adopt the assumption that a majority of chemistry hypotheses can be resulted from a research background question and several inspirations. With this key insight, we break the main question into three smaller fundamental questions. In brief, they are: (1) given a background question, whether LLMs can retrieve good inspirations; (2) with background and inspirations, whether LLMs can lead to hypothesis; and (3) whether LLMs can identify good hypotheses to rank them higher. To investigate these questions, we construct a benchmark consisting of 51 chemistry papers published in Nature or a similar level in 2024 (all papers are only available online since 2024). Every paper is divided by chemistry PhD students into three components: background, inspirations, and hypothesis. The goal is to rediscover the hypothesis given only the background and a large chemistry literature corpus consisting the ground truth inspiration papers, with LLMs trained with data up to 2023. We also develop an LLM-based multi-agent framework that leverages the assumption, consisting of three stages reflecting the more smaller questions. The proposed method can rediscover many hypotheses with very high similarity with the ground truth ones, covering the main innovations.
Profiling and Analyzing Climate Change Statements in IPCC Reports
Ruiqi Li, Paige Reeves, Alasdair Tran, Jing Jiang, Lexing Xie
We propose new methods to extract and profile the climate change statements from the Sixth Assessment Reports of the Intergovernmental Panel on Climate Change (IPCC). We represent the 10,393 statements from the latest IPCC reports (AR6) with associated uncertainty levels and glossary terms. We profile their distributions across different parts of the 6000+ page AR6 reports. We also present a few case studies centered around the glossary term ``wetland'', namely linking related statements across summary sections and chapter content, finding and profiling supporting references, and comparing them with large language models for statement summarization. We believe this work marks an initial step towards in-depth information extraction regarding climate change. It lays the groundwork for more advanced automated analysis of climate-related statements and broader integrative scientific assessments.
Towards LLM-Driven Multi-Agent Pipeline for Drug Discovery: Neurodegenerative Diseases Case Study
Gleb Vitalevich Solovev, Alina Borisovna Zhidkovskaya, Anastasia Orlova, Anastasia Vepreva, Tonkii Ilya, Rodion Golovinskii, Nina Gubina, Denis Chistiakov, Timur A. Aliev, Ivan Poddiakov, Galina Zubkova, Ekaterina V. Skorb, Vladimir Vinogradov, Nikolay Nikitin, Andrei Dmitrenko, Anna Kalyuzhnaya, Andrey Savchenko
Recent studies demonstrate that Large Language Models (LLMs) can accelerate scientific progress in chemistry and drug development. However, existing approaches have not achieved successful automation of the complete drug discovery pipeline, primarily due to the absence of comprehensive datasets and the limitations of single-model solutions. This paper introduces multi-agent approach that combines LLMs with specialized generative models and validation tools to automate the end-to-end drug discovery process. The key innovation lies in addressing the complex transition from natural language problem formulation to building a complete computational pipeline for real pharmaceutical research tasks. Experimental results demonstrate that our multi-agent solution achieves 92% accuracy in end-to-end drug search complex tasks, significantly outperforming single-agent implementations. We validated the system's effectiveness on an original newly farmed dataset with tasks and full solutions for three pharmaceutical cases targeting neurodegenerative diseases (Alzheimer's, multiple sclerosis, and Parkinson's). The main contributions include demonstrating the advantages of a multi-agent LLM-powered approach for automating pharmaceutical drug design and validating its success on real-world drug discovery challenges.
Jongwon Ryu, Junyeong Kim
Despite recent advances in large language models (LLMs), logical reasoning remains a challenging area, particularly for complex, multi-step reasoning in open-domain contexts. To address this, we introduce the Custom Graph Dataset, a novel graph-based knowledge resource designed to enhance LLMs’ reasoning capabilities. Using a Self-Prompting mechanism, our approach automatically generates both pre-defined and dynamic relations, creating a dual-triple structure (Head-Relation-Tail and Tail-Dynamic Relation-Additional Tail) that supports richer multi-step reasoning. This Self- Prompting-driven process captures a broad and adaptable range of logical connections, combining predefined relational knowledge with dynamically generated, context-specific rrelations. Experimental results demonstrate that models finetuned on this dataset significantly outperform both baseline and control models, particularly on reasoning-intensive benchmarks like Commonsense QA, Riddle Sense, and ARC Challenge. Notably, the dataset includes 133 unique dynamic relations, such as Analogous, Contextual, and Complementary, which contribute to nuanced, context-sensitive reasoning. While general-purpose data offers benefits for some tasks, our findings validate that a targeted, logic-specific dataset can substantially improve LLMs’ reasoning skills. This work underscores the potential of flexible, Self- Prompting-generated knowledge structures to advance LLM reasoning capabilities, suggesting future directions in combining structured and unstructured data to optimize inference.
DrugAgent: Automating AI-aided Drug Discovery Programming through LLM Multi-Agent Collaboration
Sizhe Liu, Yizhou Lu, Siyu Chen, Xiyang Hu, Jieyu Zhao, Tianfan Fu, Yue Zhao
Recent advancements in Large Language Models (LLMs) have opened new avenues for accelerating drug discovery processes. Despite their potential, several critical challenges remain unsolved, particularly in translating theoretical ideas into practical applications within the highly specialized field of pharmaceutical research, limiting practitioners from leveraging the latest AI development in drug discovery. To this end, we introduce DrugAgent, a multi-agent framework aimed at automating machine learning (ML) programming in drug discovery. DrugAgent incorporates domain expertise by identifying specific requirements and building domain-specific tools, while systematically exploring different ideas to find effective solutions. A preliminary case study demonstrates DrugAgent’s potential to overcome key limitations LLMs face in drug discovery, moving toward AI-driven innovation. For example, DrugAgent is able to complete the ML programming pipeline end-to-end, from data acquisition to performance evaluation for the ADMET prediction task, and finally select the best model, where the random forest model achieves an F1 score of 0.92 when predicting absorption using the PAMPA dataset.
Reflection System for the Abstraction and Reasoning Corpus
Kiril Bikov, Mikel Bober-Irizar, Soumya Banerjee
The Abstraction and Reasoning Corpus (ARC) benchmarks broad generalization in artificial intelligence, and presents a significant challenge to existing machine learning models and program synthesis solvers. In this work, we introduce a Reflection System for ARC. It combines Large Language Models (LLMs) and a program synthesis solver based on a Domain Specific Language (DSL). We analyse the accuracy of LLMs on ARC and demonstrate unsatisfactory results. We create AugARC, an augmented ARC benchmark, which consistently improves the performance of LLMs compared to the normal ARC benchmark. Using augmented ARC data, we fine-tune LLMs and observe a significant gain in ARC accuracy after training. By utilizing reflection, we combine LLMs and a previous DSL solver into our Reflection System for abstraction and reasoning. The proposed Reflection System motivates research to advance previous ARC attempts by combining the advantages of LLMs and program synthesis solvers with reflection.
What Would You Ask When You First Saw a2+b2=c2? Evaluating LLM on Curiosity-Driven Questioning
Shashidhar Reddy Javaji, Zining Zhu
Large language models (LLMs) can store a massive amount of knowledge, yet their potential to acquire new knowledge remains unknown. We propose a novel evaluation framework that evaluates this capability. This framework prompts LLMs to generate questions about a statement introducing scientific knowledge, simulating a curious person when facing the statement for the first time. We score the qualities of the generated questions, thereby evaluating the knowledge acquisition potential of the LLM. We apply controlled ablation studies to validate our scoring procedures. Additionally, we created a synthetic dataset consisting of 1101 statements in physics, chemistry, and maths with distinct levels of difficulties, 300 general knowledge statements, and 567 incorrect statements. Human evaluations were conducted to validate our model assessments, achieving an approximate weighted Cohen’s kappa of 0.7 on all three metrics considered. We find that while large models like GPT-4 and Mistral 8x7b are adept at generating coherent and relevant questions, the smaller Phi-2 model is equally or more effective. This indicates that size does not solely determine a model’s knowledge acquisition potential. The proposed framework quantifies a critical model capability that was commonly overlooked and opens up research opportunities for developing more knowledgeable AI systems.
Bridged Clustering in Scientific Research
Peixuan Ye, Yingtong Wu, Ellen Vitercik
We introduce Bridged Clustering, an algorithm that leverages existing unsupervised datasets to help achieve new supervised objectives in scientific research. Applying supervised learning to scientific research often poses the challenge of labeling enough samples to support scalable inference. As an alternative to excessive labeling, our algorithm leverages unlabeled data that is either already available in existing research or easier to collect in general. Bridged Clustering leverages two distinct sets of unlabeled data and a sparse supervised dataset to perform inference. The algorithm operates by independently clustering the input and output feature spaces, then learning a mapping between these clusters using the supervised set. This approach effectively bridges the gap between disparate data sources, enhancing predictive performance without needing extensive labeled data. We demonstrate the efficacy of Bridged Clustering in a biological context, where it successfully infers genetic information of leaf samples from their morphological traits. In general, Bridged Clustering offers a robust framework for utilizing available unlabeled data to support new inference objectives in scientific research, especially where labeled data is scarce.
Towards Scalable Oversight: Meta-Evaluation of LLMs as Evaluators via Agent Debate
Steffi Chern, Ethan Chern, Graham Neubig, Pengfei Liu
Despite the utility of Large Language Models (LLMs) across a wide range of tasks and scenarios, developing a method for reliably evaluating LLMs across varied contexts continues to be challenging. Modern evaluation approaches often use LLMs to assess responses generated by LLMs. However, existing meta-evaluation methods to assess the effectiveness of LLMs as evaluators is typically constrained by the coverage of existing benchmarks or require extensive human annotation. This underscores the urgency of methods for scalable meta-evaluation that can effectively, reliably, and efficiently evaluate the performance of LLMs as evaluators across diverse tasks and scenarios, particularly in potentially new, user-defined scenarios. To fill this gap, we propose ScaleEval, an agent-debate-assisted meta-evaluation framework that leverages the capabilities of multiple communicative LLM agents. This framework supports multi-round discussions to assist humans in discerning the capabilities and limitations of LLMs as evaluators, which significantly reduces their workload in cases that used to require much supervision and large-scale annotations during meta-evaluation.