Text corpora

1. Topic Identification:

Khaleej-2004

Size: 4.1 MB

Number of categories (topics): 4

Watan-2004

Size: 14.4 MB

Number of categories (topics): 6

2. Machine Translation:

PADIC: Parallel Arabic DIalectal Corpus

- contains six dialects in addition to MSA.

- is in Buckwalter format.

- More than 6000 sentences.

Khaleej-2004 corpus

I have prepared this corpus in order to achieve experiments on Topic Identification for Arabic language. It has been extracted from thousands of articles which had been downloaded from an online newspaper.

The corpus contains more than 5000 articles which correspond to nearly 3 millions words.

Punctuation has been deleted on purpose. For more information, check the works based on Khaleej-2004 corpus:

  • M. Abbas, K. Smaili. Comparison of Topic Identification Methods forArabic Language, International conference RANLP05 : Recent Advances in Natural Language Processing , 21-23 september 2005, Borovets, Bulgary. [pdf]

  • M. Abbas, K. Smaili, D. Berkani. Multi-category support vector machines for identifying Arabic topics, 10th International Conference on Intelligent Text Processing and Computational Linguistics - CICLing 2009 (2009), Mexico [pdf]

  • M. Abbas, D. Berkani. Topic Identification by Statistical Methods for Arabic language. Wseas Transactions on Computers", Issue 9. Volume 5. pp. 1908-1913. 2006. [pdf]

Watan-2004 corpus

Watan-2004 corpus contains about 20000 articles talking about the six following topics "categories":

Culture, Religion, Economy, Local News, International News and sports.

In this corpus, punctuation has been omitted intentionally in order to make it useful for Language Modeling.

My works based on Watan-2004 corpus:

  • M. Abbas, K. Smaili, D. Berkani. Comparing TR-Classifier and kNN by using Reduced Sizes of Vocabularies. The 3rd International Conference on Arabic Language Processing, CITALA 2009, 4-5 May 2009, Mohammadia School of Engineers, Rabat, Morroco.

  • M. Abbas, K. Smaili, D. Berkani, "Evaluation of Topic Identification Methods for Arabic Texts and their Combination by using a Corpus Extracted from the Omani Newspaper Alwatan." Arab Gulf Journal of Scientific Research 29.3-4 (2011): 183-191.

  • M. Abbas, K. Smaili, D. Berkani. (2011). Evaluation of Topic Identification Methods on Arabic Corpora. Journal of Digital Information Management Vol. 9 No. 5, pp.185-192.

  • M. Abbas, K. Smaili, D. Berkani. (2010). TR-Classifier and kNN Evaluation for Topic Identification Tasks. Special Issue on Advances in Arabic Language Processing, the International Journal on Information and Communication Technologies (IJICT), Vol 3, N 3, pp. 65-74, Serial Publications.

  • M. Abbas, K. Smaili, D. Berkani. Efficiency of TR-Classifier versus TFIDF. First International Conference on Integrated Intelligent Computing, August 5-7, 2010.

PADIC

PADIC (Parallel Arabic DIalectal Corpus) is a multi-dialectal corpus built in the framework of the National Research Project "TORJMAN", code: 24/u23/1902, led by Scientific and Technical Research Center for the Development of Arabic Language and funded by the Algerian Ministry of Higher Education and Scientific Research.

PADIC:

- contains six dialects in addition to MSA.

- is in Buckwalter format.

- More than 6000 sentences.

Other Arabic corpora can be found in the Blark Content.

N.B: This corpus is only for scientific use. However, any use of it in order to create and release other ressources or software must have the authorization of Mourad Abbas.