We present a Foreign Language Search Assistant that makes use of noun phrases as a fundamental unit for document translation and query formulation and refinement. The system:
Supports the foreign-language document selection task providing a cross-language indicative summary based in noun phrase translations.
Supports query formulation and refinement using the information found in the cross-language document summaries.
In order to translate both query and documents, a bilingual dictionary of noun phrases is previously built, using an algorithm that aligns phrases between two languages using only information about the component lemmas, the possible translations given in a bilingual dictionary, and the frequency of the phrases in two comparable corpora.
From a test set of approximately 21 million phrases of two and three lemmas in Spanish and English, our algorithm is able to align 3.9 millions, with a precision of 80% for phrases with three lemmas and 74% for phrases with two lemmas. Non-aligned noun phrases are translated using an algorithm that iteratively searches for aligned maximal sub-phrases, and uses alignment information to obtain optimal translations for the terms that remain untranslated.
The evaluation of the different aspects of the process have been carried out in the framework of \texttt{iCLEF} (Interactive Cross-Language Evaluation Forum), where the approach is compared with two reference systems:
Our Cross-Language summaries perform 25% better, according to the official iCLEF measure, than the translations provided by Systran Professional 3.0. Users are able to judge documents faster with summaries, at similar precision rates.
Phrase-based query formulation and refinement behaves 64% better than a standard approach that assists interactive query term translation. Users formulate queries faster and better, and make more interactions, with the phrase-based system than with the assisted translations system. Besides quantitative measures, both the questionnaires filled by searchers and the observational study of search sessions also confirm qualitatively the benefits of our approach.
Our results challenge two implicit assumptions in most of Cross-Language Information Retrieval research: first, that once documents in the target language are found, Machine Translation is the optimal way of informing the user about their contents; and second, that in an interactive setting the optimal way of formulating and refining the query is helping the user to choose appropriate translations for the query terms.