To provide a module for rapid grammar induction for both resource rich and resource poor domains. We will start from the existing prototypes for bottom-up and top-down grammar induction and adapt them to SDS. Then we will fuse the knowledge-based and corpus-based grammar creation,
and provide an interface for manual correction and extension of grammars. Specific methods and tools will be devised to support the following objectives:
- Create high quality grammar resources covering main grammar formats commonly used in SDS.
- Reuse grammar resources that are already available.
- Integrate data compiled for ontology learning and enrichment.
- Provide an easy-to-use interface in order to facilitate rapid prototyping to be integrated into the grammar induction platform.
DESCRIPTION OF WORK
- Grammar Induction from the Lexical-Semantic Interface: Adapting the top-down grammar creation approach developed by UNIBI using lexicalized ontologies to SDS grammars (for resource rich domains). Subtasks are the modification of the grammar creation algorithm to GF grammars and the integration of existing algorithms for generating SDS grammars from GF grammars.
- Corpus-based Grammar Induction: Starting from TSI-TUC's corpus-based grammar induction system (for resource-poor domains) we will: 1) enhance the web data harvesting module to better query for, filter and select data that is SDS related, 2) improve on the grammar induction module to take advantage of bootstrap grammars if available (in addition to web data), and 3) combine parsing with statistical semantic relatedness metrics to classify/attach grammar fragments to the domain ontologies.\
- Fusion of Ontology and Corpus-based Approaches: Three fusion methods will be investigated, namely: early integration (at the lexicalized ontology level), mid-level integration (top-down grammars are used to bootstrap the bottom-up method) or late integration (grammar fragments produced by both methods are combined and post-edited). A general architecture will be devised where the relative weights of the two methods can be adjusted depending on the availability of resources (poor vs rich scenarios). Special care will be taken for grammar preamples/tails where corpus-based methods perform better. Both finite-state and statistical grammars (n-grams) will be generated by this module.
- Interface for Grammar Enrichment: For the commercialization of grammar induction technology we adopt a iterative machine-aided approach,where a human in involved in the following stages: 1) (post-)editing of lexicalized ontologies, 2) selection of grammar fragments, and 3) correction of parsing errors (misclassifications of grammar fragments to concepts). A user-friendly and efficient interface will be designed to minimize post-editing effort and maximize grammar coverage.
- Paraphrasing Interface for Grammar and Prompt Enrichment: We will investigate the relevance of paraphrasing technology for suggesting alternative wording of existing grammar lexicalizations and SDS prompts.