CAREER: Achieving Quality Information Extraction from Scientific Documents with Heterogeneous Weak Supervisions
NSF-2237831
Principle InvestigatorÂ
Qi Li, Iowa State University
Students
Qing Wang. PhD student
Adithya Kulkarni. PhD student
Qiao Qiao. PhD student
Mohna Chakraborty. PhD student
Yuepei Li. PhD student
Kang Zhou. PhD student
Yonas Sium. PhD student
Wei Ying. PhD student
Xiang Ma. PhD student
Award Information
This website is based upon work supported by the National Science Foundation under Grant No. 2237831. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.
Project Summary
The volume and breadth of the scientific literature is growing at an astonishing pace, making it challenging for researchers to keep up. Information extraction systems that can automatically extract structured information from this unstructured text are in high demand. Benefits from automated information extraction (IE) are multi-fold: it is easier to search and organize scientific documents, it results in efficiency gains for curators, and it reduces curation costs, among others. Although supervised deep learning-based IE methods achieve curation-level performance on some applications, large training datasets with accurate annotations are necessary to achieve these results. The goal of this project is to develop an adaptable and flexible information extraction framework that learns from existing resources and does not rely on costly and time-consuming expert annotations, and bridges the performance gap in real applications addressing extraction quality concerns and unique requirements of IE tasks in the scientific literature. Success in this project will benefit many domains by providing mechanisms for processing massive unlabeled textual datasets, speeding up literature understanding and the curation process, and promoting new scientific discoveries. The investigator will engage in departmental Broadening Participation in Computing (BPC) activities and create educational materials based on results from this project for outreach programs to local k-12 schools and communities.
This project is focused on three complementary research thrusts, each of which addresses one key obstacle of information extraction on scientific documents: 1) advancing IE models to work with heterogeneous supervisions such as distant supervision and indirect supervision while taking advantage of all existing resources, 2) developing new semi-open information extraction tasks to extract detailed context and uncertainties at the document level, and 3) developing a novel learn-from-mistake paradigm that integrates first-order logic rules and new annotations from domain users to refine the IE models and results. The proposed research will address a variety of problems drawn from different information extraction settings, which will lead to new principles, methods, and technologies for machine learning, data mining, and natural language processing. The research thrusts will be applied to extract information from STEM textbooks to construct concept networks for education purposes.
Publications
Courses
COM S 661: advanced database systems