Abstract

The objective of this research is to evaluate the feasibility of exploiting inference between languages in NLP. We focus on inference phenomena within word formation, which we have named multilingual morphosemantic links, and on the feasibility of their exploitation in machine translation. This work aims to be both practical and theoretical. On the theoretical side, we question the grounds for this type of inference and propose a first attempt at formalising it. On the practical side, we show how this inference can be exploited to solve an important issue in machine translation: lexical incompleteness.

Any NLP applications based on lexica highly depend on the completeness of the resource. A word that is not in the lexicon cannot be processed by the system, which can have consequences - more or less important - on the quality of the output of the system. Depending on the application, many different solutions have been investigated for how to compensate for lexical incompleteness and to guess the unknown. In a machine translation system, where a transfer between two languages is implied, guessing the unknown is very complex because it involves dealing with the unknown at both the analysis and the generation steps of the translation process.

Unknown words in machine translation systems can be of different kinds (proper name, erroneous words, words coming from lexical creativity), but in this research we concentrate on the latter ones. These words constitute a dynamic class of items: some will eventually be added to the lexicon; others will exist only at the time at which they are produced and perceived. Exploiting and formalising multilingual morphosemantic links in machine translation aims to propose a translation for an unknown word, without having to add it to the lexicon.

For practical reasons, we concentrate on only one construction process (prefixation) and on two languages (Italian and French, voluntarily chosen because they are »related«, and have consequently less divergences). Nonetheless, the proposed methods and solutions are applicable to other neological formation processes and to other language pairs.

The first part of this work presents various studies of lexical incompleteness in different machine translation systems and other NLP tools. These studies showed that the phenomenon of lexical incompleteness is constant whatever the system evaluated, and that the solution to this problem cannot simply be to «feed the lexicon» with unknown words. Moreover, a qualitative analysis of the unknown words highlights that a large number of them are neologisms that are constructed from regular processes. These constructed neologisms are also strongly influenced by the contact between languages, which brings us to imagine a parallelism in creating neologisms between languages, and possible exploitation in machine translation.

The second part precisely defines the notion of a multilingual morphosemantic link, which helps us represent construction similarities between languages. This link is defined according to a double reproducibility: within one language and between two languages. To be exploited in machine translation, these links are formalised through bilingual Lexeme Formation Rules (LFR), adopting a lexematic approach of morphology that provides ideal descriptive means to deal with neologisms. Building up these LFR necessarily requires a deep study of the morphological systems of the two languages, and a contrastive study of the construction processes. This contrastive approach is based on the use of a tertium comparationis, which is a theoretical platform onto which we can «project» the elements to be compared. The «projection» gives the translational material to implement bilingual LFR, and shows, in a refinement step, structural divergences that have to be taken into account.

The third part of this work deals with implementing the LFR in a machine translation context. To do so, we build a prototype system to translate automatically prefixed neologisms. This system allows us to experiment with every step of the automated translation process. We show that the main challenge is in the stage dealing with the analysis of the unknown words. This is where most of the work with special constraints has to be done to ensure optimal performance of the output. The generation stage mainly requires an adequate bilingual lexicon, but some specific issues have also been found related to prefixation, i.e. alternating prefixes (like in multidimensionel or pluridimentionel), and alternating bases (anticancer or anticancéreux).

In the fourth and final part we evaluate the entire approach. The first step consists in evaluating the quality of the translated neologisms, and the influence on the quality of the entire sentence once the neologism is translated. The second is to raise the question of the feasibility and portability of this approach in order to highlight the main conditions necessary to make such a system work. We show that strong theoretical grounds with linguistic principles and appropriate constraints and resources are the main prerequisite to take advantage of multilingual morphosemantic links to deal with unknown words in machine translation system.