Project Overview

Project Title: Fine-Grained Semantic Markup of Descriptive Data for Knowledge Applications in Biodiversity Domains

NSF Award #: EF-0849982

Project Duration: 36 months, starting on 09/01/09

This project further develops a set of unsupervised machine learning algorithms and creates a suite of domain-independent, high-throughput software that marks textual descriptive data of taxonomic treatments to support various knowledge applications, including producing character matrices and identification keys for different taxon groups. The algorithms, based on bootstrapping techniques, address the weaknesses of previously published approaches, such as the requirement of training examples as in supervised machine learning approaches, or of hand-crafted regular expression or grammar rules as in syntactic parsing approaches. These requirements raise the bar for adopting the technology and limit the technology’s application in other ways, as they likely have to be met on a collection by collection basis. The unsupervised algorithms exploit the fundamental characteristic of taxonomic descriptions, namely, rich in concepts connected by simple syntax used repeatedly in different contexts. The algorithms start with a few known concepts, learn previously unknown concepts iteratively, and eventually identify characters/states for producing keys. The algorithms also play a significant role in improving coverage and literary warrant for domain ontologies:  they verify the actual use of ontological concepts and relationships in domain literature and propose new ones discovered from the literature for inclusion in ontologies. The project ensures the robustness of the software by testing it on documents of different taxon groups (e.g. ants, plants, and invertebrate fossils), of different sources (e.g. journal articles, monographs, and multi-volume reference works), and of different origin (e.g. OCRed and born digital documents). The study of different types of documents deepens our understanding of the commonalities and differences among different communities of practice, removes overly simplified assumptions from the algorithms to ensure their robustness and flexibility, and increases the awareness of homographs and synonyms across biodiversity domains to help ensure the quality of domain ontologies. The algorithms are also comparatively evaluated with algorithms developed by another informatics team on jointly developed benchmarks and quality- and effort-based metrics.  The success of this project produces high quality, reusability resources (e.g. lexicons, benchmarks, etc.) for future bioinformatics research, pushes huge quantities of biosystematics documents into new semantic-aware formats with minimal human intervention, speeds the creation of identification keys and domain ontologies, and enables innovative tools for biodiversity research and information management, for example relating geo-references or genome sequences to morphological characters, and intelligent quality assurance tools (e.g. parallelism in descriptions) for editors.