Reuters RCV1/RCV2 Multilingual, Multiview Text Categorization Test collection

Distribution 1.0

30 September 2009

Massih R. Amini (Université Joseph Fourier, Grenoble, France)

and Cyril Goutte (National Research Council Canada)

Overview

This test collection contains feature characteristics of documents originally written in five different languages (English, French, German, Spanish and Italian), and their translations, over a common set of 6 categories. This collection can be used for multilingual categorization, crosslingual categorization and multiview learning (one view=one language) research. Documents have been translated and preprocessed as explained below, and are made available as feature characteristics in a "bag of words" format.

Corpus Acquisition & Processing

Documents from 6 large Reuters categories (CCAT, C15, ECAT, E21, GCAT and M11) were extracted from RCV1 (for English), and RCV2 (for French, German, Italian and Spanish). We sampled up to 5000 documents for each category in each language. Documents belonging to more than one of the 6 categories were assigned to the smallest category. This resulted in 12-30K documents per language, and 13-21K documents per class.

In order to produce multilingual versions of each document, each original document was translated into the other 4 languages using a statistical machine translation system. We used the Portage system described by Ueffing et al. (2007), trained on the Europarl corpus for the 20 language pairs required here.

Each of the resulting 558,700 document versions (111,740 documents in 5 languages) was preprocessed and indexed using a standard preprocessing chain including removal of stopwords and low-frequency words. Documents were then represented as a bag of words using a TFIDF-based weighting scheme.

In studies carried out using this collection, one language is typically considered as one view of the document. We therefore have 5 views of each of the 111,740 documents extracted from the Reuters corpus. In Amini et al. (2009), 20% of the documents were reserved as test set, and the results were averaged over 10 random choices of labeled training examples.

Language Distribution

Class Distribution

Download & Copyright

The collection is now available through UCI (160 MB, 464 MB uncompressed); here is a direct link to bzipped archive . This test collection is publicly available for research purposes only. The original textual data belongs to Reuters and may be obtained from NIST.

If you publish results based on this processed dataset, please acknowledge its use, by referring to:

M.-R. Amini, N. Usunier, C. Goutte. Learning from Multiple Partially Observed Views - an Application to Multilingual Text Categorization. Advances in Neural Information Processing Systems 22 (NIPS 2009), 2009.

Content

Uncompressing the archive will create the directory MultiLingualReutersCollection/ which contains 5 subdirectories EN, FR, GR, IT and SP, corresponding to the 5 languages. Each subdirectory in {EN, FR, GR, IT, SP} contains 5 files, each containing indexes of the documents written or translated in that language. For example, EN contains files

    • Index_EN-EN : Original English documents,

    • Index_FR-EN : French documents translated to English,

    • Index_GR-EN : German documents translated to English,

    • Index_IT-EN : Italian documents translated to English,

    • Index_SP-EN : Spanish documents translated to English,

And similarly for the 4 other languages.

Each file contains one indexed document per line, in a format similar to SVM_light. Each line is of the form: cat feature:value feature:value ... where cat is the category label, ie one of C15, CCAT, E21, ECAT, GCAT or M11. feature:value is the (feature, value) pair, in ascending order of feature index

The order of documents is maintained in corresponding files, for example, FR/Index_EN-FR and EN/Index_EN-EN have the same number of documents (and therefore the same number of lines), in the same order.

Acknowledgements

We thank Reuters for making the RCV1/RCV2 data available and granting permission to distribute processed versions of it. More information are available in the README file.

Bibliography

@inproceedings{AUG09,
author = "Massih-Reza Amini and Nicolas Usunier and Cyril Goutte",
title = "Learning from Multiple Partially Observed Views - an Application to Multilingual Text Categorization",
booktitle = "Advances in Neural Information Processing Systems 22 (NIPS 2009)",
url = "
http://books.nips.cc/papers/files/nips22/NIPS2009_0688.pdf",
pages = "28--36",
year = "2009"
}

@inproceedings{USLJ07,
author = "Nicola Ueffing and Michel Simard and Samuel Larkin and J.~Howard Johnson",
title = "{NRC}'s {PORTAGE} system for {WMT} 2007",
booktitle = "In ACL-2007 Second Workshop on SMT",
url = "
http://www.statmt.org/wmt07/pdf/WMT24.pdf",
pages = "185--188",
year = "2007"
}