Work Plan

The PortDial work-plan covers four main areas:
  1. Ontology population, enrichment and lexicalization: The main objective here is to create domain ontologies for SDS combining linguistic resources and corpus-based methods. For resource-rich domains, we simply select a subset of the existing ontology (shown on the right). For resource-poor domain (shown on top), the ontology is generated in a machine-aided fashion using web corpora and semantic similarity metrics; two steps are shown: ontology population and enrichment. In addition,  we mine for related web data, extract named entities and relations from these data, and attaching the data to domain ontology concepts.
  2. Grammar induction for domain porting combining knowledge-based and data-driven approaches: We focus on grammar induction for both resource rich and resource poor languages/domains. A corpus-based bottom-up approach is implemented for resource poor domains: starting from a domain ontology (input from WP2), the ontology is mildly lexicalized (i.e., 2-3 examples are manually provided for each concept), queries are generated using these lexicalization spanning multiple levels of the ontology, web data is harvested and filtered, and finally agglomerative clustering is used for grammar induction. As a last step grammar fragments are classified/attached to the ontology. For the resource-rich scenario a top-down approach is used: the ontology is fully lexicalized (also using web data attached to the ontology) and the grammar is generated directly from the lexicalized ontology. The output of the two approaches is combined (late integration). A human annotator corrects attachment errors and selects the "best" grammar fragments.
  3. Grammar induction for language porting using machine translation: Here technologies and interfaces developed for porting SDS resources are combined with machine translation for the language porting scenario. The top-down approach here uses a multilingual lexicalized ontology for grammar induction. The bottom-up approach uses machine translation technology for mildly lexicalizing the ontology (and then runs bottom-up grammar induction) for harvesting appropriate parallel data and/or directly translating the grammars. The outputs of these methods (top-down, bottom-up, direct translation) will be combined. An interface for selection and post-editing is also planned here.
  4. Integration of the technology and data into a speech services prototyping platform: Here the ontology evolution and grammar induction modules are integrated into the platform. In addition, the linguistic resources are cleaned-up and packaged. Last but not least the performance of the platform and data is evaluated via prototype service creation. The data and platform that are the main outputs that will be commercially exploited by the SME partners.