CAREER: Achieving Quality Information Extraction from Scientific Documents with Heterogeneous Weak Supervisions
NSF-2237831
Principle Investigator
Qi Li, Iowa State University
Students
Qing Wang. PhD student
Adithya Kulkarni. PhD student (graduated 2024)
Qiao Qiao. PhD student
Mohna Chakraborty. PhD student (graduated 2024)
Yuepei Li. PhD student
Kang Zhou. PhD student (graduated 2024)
Yonas Sium. PhD student
Wei Ying. PhD student
Xiang Ma. PhD student
Award Information
This website is based upon work supported by the National Science Foundation under Grant No. 2237831. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.
Project Summary
The volume and breadth of the scientific literature is growing at an astonishing pace, making it challenging for researchers to keep up. Information extraction systems that can automatically extract structured information from this unstructured text are in high demand. Benefits from automated information extraction (IE) are multi-fold: it is easier to search and organize scientific documents, it results in efficiency gains for curators, and it reduces curation costs, among others. Although supervised deep learning-based IE methods achieve curation-level performance on some applications, large training datasets with accurate annotations are necessary to achieve these results. The goal of this project is to develop an adaptable and flexible information extraction framework that learns from existing resources and does not rely on costly and time-consuming expert annotations, and bridges the performance gap in real applications addressing extraction quality concerns and unique requirements of IE tasks in the scientific literature. Success in this project will benefit many domains by providing mechanisms for processing massive unlabeled textual datasets, speeding up literature understanding and the curation process, and promoting new scientific discoveries. The investigator will engage in departmental Broadening Participation in Computing (BPC) activities and create educational materials based on results from this project for outreach programs to local k-12 schools and communities.
This project is focused on three complementary research thrusts, each of which addresses one key obstacle of information extraction on scientific documents: 1) advancing IE models to work with heterogeneous supervisions such as distant supervision and indirect supervision while taking advantage of all existing resources, 2) developing new semi-open information extraction tasks to extract detailed context and uncertainties at the document level, and 3) developing a novel learn-from-mistake paradigm that integrates first-order logic rules and new annotations from domain users to refine the IE models and results. The proposed research will address a variety of problems drawn from different information extraction settings, which will lead to new principles, methods, and technologies for machine learning, data mining, and natural language processing. The research thrusts will be applied to extract information from STEM textbooks to construct concept networks for education purposes.
Publications
[COLING25A]Yuepei Li, Kang Zhou, Qiao Qiao, Qing Wang, and Qi Li. Re-Examine Distantly Supervised NER: A New Benchmark and a Simple Approach. The 31st International Conference on Computational Linguistics (COLING'25). [paper][code&data]
[COLING25B]Xiaqiang Tang, Qiang Gao, Jian Li, Nan Du, Qi Li, and Sihong Xie. MBA-RAG: a Bandit Approach for Adaptive Retrieval-Augmented Generation through Question Complexity. The 31st International Conference on Computational Linguistics (COLING'25).
[NAACL24] Kang Zhou, Yuepei Li, Qing Wang, Qiao Qiao, and Qi Li. GenDecider: Integrating “None of the Candidates” Judgments in Zero-Shot Entity Linking Re-ranking. Proc. of 2024 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL'24), 2024.
[EMNLP23] Qing Wang, Haojie Jia, Wenfei Song, and Qi Li. CoRec: An Easy Approach for Coordination Recognition. Proc. of 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP'23), 2023. [paper]
[EMNLP23] Qing Wang, Kang Zhou, Qiao Qiao, Yuepei Li, and Qi Li. Improving Unsupervised Relation Extraction by Augmenting Diverse Sentence Pairs. Proc. of 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP'23), 2023. [paper]
[EMNLP23] Ying Wei and Qi Li. ScdNER: Span-Based Consistency-Aware Document-Level Named Entity Recognition. Proc. of 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP'23), 2023. [paper]
Courses
COM S 661: advanced database systems
COMS 5790: Natural Language Processing