CAREER: Achieving Quality Information Extraction from Scientific Documents with Heterogeneous Weak Supervisions

NSF-2237831

Principle Investigator 

Qi Li, Iowa State University

Students

Qing Wang. PhD student

Adithya Kulkarni. PhD student

Qiao Qiao. PhD student

Mohna Chakraborty. PhD student

Yuepei Li. PhD student

Kang Zhou. PhD student

Yonas Sium. PhD student

Wei Ying. PhD student

Xiang Ma. PhD student


Award Information

This website is based upon work supported by the National Science Foundation under Grant No. 2237831. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.

Project Summary

The volume and breadth of the scientific literature is growing at an astonishing pace, making it challenging for researchers to keep up. Information extraction systems that can automatically extract structured information from this unstructured text are in high demand. Benefits from automated information extraction (IE) are multi-fold: it is easier to search and organize scientific documents, it results in efficiency gains for curators, and it reduces curation costs, among others. Although supervised deep learning-based IE methods achieve curation-level performance on some applications, large training datasets with accurate annotations are necessary to achieve these results. The goal of this project is to develop an adaptable and flexible information extraction framework that learns from existing resources and does not rely on costly and time-consuming expert annotations, and bridges the performance gap in real applications addressing extraction quality concerns and unique requirements of IE tasks in the scientific literature. Success in this project will benefit many domains by providing mechanisms for processing massive unlabeled textual datasets, speeding up literature understanding and the curation process, and promoting new scientific discoveries. The investigator will engage in departmental Broadening Participation in Computing (BPC) activities and create educational materials based on results from this project for outreach programs to local k-12 schools and communities.

This project is focused on three complementary research thrusts, each of which addresses one key obstacle of information extraction on scientific documents: 1) advancing IE models to work with heterogeneous supervisions such as distant supervision and indirect supervision while taking advantage of all existing resources, 2) developing new semi-open information extraction tasks to extract detailed context and uncertainties at the document level, and 3) developing a novel learn-from-mistake paradigm that integrates first-order logic rules and new annotations from domain users to refine the IE models and results. The proposed research will address a variety of problems drawn from different information extraction settings, which will lead to new principles, methods, and technologies for machine learning, data mining, and natural language processing. The research thrusts will be applied to extract information from STEM textbooks to construct concept networks for education purposes.

Publications



Courses

COM S 661: advanced database systems



Activities

In conjunction with NRT-D4 outreach