Cupral 2017

IJCNLP 2017 Workshop on Curation and Applications of Parallel and Comparable Corpora

November 27, 2017

Taipei, Taiwan

BRIEF and Updates

  • The Workshop program is ready now!
  • Invited speaker is Manoj Kumar Chinnakotla from Microsoft India
  • Paper submission due on September 5, 2017
  • Workshop will be held on November 27, 2017
  • Multi-lingual and Multi-modal Corpora released, upon expression-of-interests
  • If you need any information, please contact Haithem Afli `haithem (dot) afli (at) adaptcentre (dot) ie`

Workshop Description

The explosive growth of multimodal data, both online and in private repositories owned by diverse institutions, has led to urgent requirements in terms of processing and management of digital content. Solutions for providing access to such data effectively depend on the connection between its different modalities. One perspective has been integrated modelling of language and vision for multimodal and multilingual documents, inspired by recent work in multimodal natural language processing.

This workshop will explore the multifarious aspects of effective document alignment in multimodal and multilingual context. Most businesses operating across international borders understand the value of localization. In order to make a connection they have to be able to speak the language of their customers. Websites, marketing materials, news and other high-impact elements should all be thoroughly localized, which can mean a combination of computer vision (CV) and text processing in many target languages.

Clearly, techniques of Natural Language Processing (NLP) and Information Retrieval (IR) can be incredibly useful and further their combination with CV can potentially improve state-of-the-art document alignment research. Additionally, the aligned multimodal documents can be seamlessly used to improve the quality of predictive analytics on multi-modal data involving both text and images, e.g. the associated images of news articles may be utilized to help improve the ranks of these articles in a search engine or to translate the article better in a different language.

The proposed workshop aims to provide a forum for researchers working on related fields to present their results and insights. The workshop will aim to bring together researchers from diverse fields, such as CV, IR and NLP, who can potentially contribute to improving the quality of multimodal document alignment and its utilization in research and industrial data analytics tasks.

The workshop aims to provide a research platform dedicated to new method and techniques on aligning multimodal and multilingual documents, and exploring the use of such technology in NLP or IR. The workshop will solicit original research contributions related to the theme, which includes (but is not limited to):

· Models and Tools Development for multimodal document alignment

· Deep learning for document alignment

· Multi-lingual crowdsourcing

· Building resources for document alignment

· Analyzing the diffusion of multilingual information

· Multilingual and language-specific Information Retrieval on Social Web

· Cross-lingual document alignment using user-generated content data

· Document alignment for Big social data analysis

· Multimodal translations

· Automatic and semi-automatic methods for document alignment

· Methods to mine parallel and non-parallel corpora from the web

· Tools and criteria to evaluate the comparability of corpora in multimodal context

· Parallel vs non-parallel corpora, monolingual corpora

· Rare and minority languages, across language families

· Multi-media/multi-modal comparable corpora

Paper Submission

The format of the workshop papers follow the same format as outlined in IJCNLP 2017 conference. We solicit papers of five-to-eight pages, excluding an additional page that could be used to accommodate references.

Paper submission is due on September 5, 2017 and the workshop will be held on November 27, 2017.

Released Multi-lingual and Multi-modal Corpora

In the workshop we will release two corpora and solicit our community to make use of them to explore their possible applications. These two corpora will be made available to the participants of the workshop. The participants are encouraged be innovative on how to use them and preferably in novel applications.

For the first corpus, we will be using comparable news texts and images from Euronews website. These comparables are collected on a daily basis for more than three years. We will release the English, German and French parts of the curated data. There are approximately 40,000 English articles, of which circa 36,000 of them have comparable counterparts in French. The same images and their corresponding links used in comparable documents in different languages will also be released.

The second corpus includes 506 documents. In each document, the information extracted from Film corpus 2.0 (e.g. the speaker names and scene boundaries which could be used as contextual markers) and multilingual subtitles from are carefully processed and aligned to form an integrated record of one movie. In the parallel corpus of English and Mandarin Chinese (of all the 506 documents) there are 635,326 aligned sentence pairs with their conversational contexts being preserved.


Haithem Afli, ADAPT Centre, Dublin City University

Chao-Hong Liu, ADAPT Centre, Dublin City University

Debasis Ganguly, Dublin Research Lab, IBM Ireland

Longyue Wang, Dublin City University

Alberto Poncelas, Dublin City University

Iacer Calixto, Dublin City University