Arabic Datasets

Echorouk collection contains 11,313 documents from Echorouk newspaper articles relating to 2008-2009 period. It is labeled according to 8 categories. The full corpus (Ech-11k) is used for validation experimentations and a subset (Ech-4000) of 4,000 documents is applied for preliminary evaluations.

The Reuters collection (Rtr-41k) contains 41,251 Arabic documents relating to 2007-2008-2009 period. It is labeled according to 6 categories. A subset (Rtr-5251) of 5,251 documents is used for preliminary evaluations.

The Xinhua collection contains 36,696 Arabic documents relating to 2008-2009 period. It is labeled according to 8 categories. A subset (Xnh-4500) of 4,500 documents is applied for preliminary evaluations. Table 8 describes the collected datasets with their distributions over published categories.

Titles and Categories

Each article is saved in a separate file where the first line includes its title and the file name extension represents the category label as follows:

Description of three datasets relating to Echorouk, Reuters and Xinhua Web-articles

Distribution of the three datasets over categories.

Download

The Ech-4000 dataset is available here.

The Rtr-5251 dataset is available here.

The Xnh-4500 dataset is available here.

Reference

These resources can be used for research purposes only.

The results about Arabic topic modeling and text categorization are described in this paper.

For more information contact me at : brahmi@univ-mosta.dz

If you use these corpora, please cite the following paper :

Brahmi, A., Ech-Cherif, A., & Benyettou., A. (2012). Arabic texts analysis for topic modeling evaluation. Information Retrieval, Vol. 15, No. 1, pp. 33-53. DOI : http://dx.doi.org/10.1007/s10791-011-9171-y.

Page updated

Google Sites

Report abuse