OBJECTIVES
  1. The creation of domain ontologies for SDS combining linguistic resources and corpus-based methods,
  2. mining for related web-data, extracting named-entities and relations from these data and attaching the data to domain ontology concepts and
  3. lexicalizing the SDS ontologies, adding lexical variants and grammar fragments.
Semantic relatedness metrics is an enabler technology for ontology population, enrichment and lexicalization, especially for resource-poor domains. Since the technologies are not yet mature for fully automated ontology creation, an iterative machine-aided approach is proposed here with a human in the loop.

DESCRIPTION OF WORK

  • Multilingual, Scalable Semantic Relatedness Computation: Semantic similarity is the main tool for corpus-based ontology population and enrichment. In this task, we will investigate the performance of TSI-TUC's state-of-the-art corpus-based semantic relatedness computationalgorithm for multi-word terms and named entities that commonly appear in SDS systems. Web data willbe harvested for all PortDial languages (in addition to English and Greek where data already exists) and performance will be evaluated for all languages. Scalability to semantic networks with 1M+ words and terms will be verified.
  • Domain-Specific Ontology Population and Enrichment: Starting from TSI-TUC's OntoGain system for ontology learning we will: 1) incorporate corpus-based semantic relatedness metrics for ontology enhancement, 2) investigate the fast adaptation of general purpose ontologies to SDS domains, 3) perform ontology population for terminal concepts, and 4) post-edit ontologies using a Protégé plug-in that will be authored specifically for SDS. Both resource-rich (adaptation of general purpose ontology to SDS) and resource-poor (corpus-based ontology learning starting from small bootstrap ontology) scenarios will be investigated.
  • Named Entity Detection: Web harvested data (see also next task) will be processed using ES state-of-the-art algorithms for: 1) identification of named entities in provided documents, including definition of entity type and possible entity attributes, 2) identification of relationships among entities, or between entities and other concepts, and 3) tagging of documents with extracted information so that subsequent tasks have information for attaching documents to domain ontologies. Named-entities are relevant for ontology population and lexicalization.
  • Data Mining and Attaching Documents to Domain Ontologies: Web-data will be harvested using web crawlers and queries using ES and TSI-TUC infrastructure. The output will be used for the: 1) extraction of identified entities and relations, including attributes from data, 2) identification of ontologies or ontology concepts related to the extracted entities, and 3) attachment of extracted data to the ontologies, so to produce ontological concepts annotated with entities/relations.
  • Lexicalization of Domain Ontologies: This task consists in the development of algorithms for the automatic learning of ontology lexica. This involvesextracting lexical entries from the data mined, discovering lexical variants using the semantic relatedness measures , extending the lexicon parallel to ontology enrichment, and providing an interface for the manual correction of generated lexica.