Arabic Corpora

I have prepared this corpus for realizing experiments on Topic Identification for Arabic language. It has been extracted from thousands of  articles which had been downloaded from an online newspaper.
The corpus contains more than 5000 articles which correspond to nearly 3 millions words.
Punctuation has been deleted on purpose. For more information, see the works based on Khaleej corpus:
  • M. Abbas, K. Smaili. Comparison of Topic Identification Methods forArabic Language, International conference RANLP05 : Recent Advances in Natural Language Processing , 21-23 september 2005, Borovets, Bulgary. [pdf]

 

  • M. Abbas, D. Berkani. Topic Identification by Statistical Methods for Arabic language. Wseas Transactions on Computers", Issue 9. Volume 5. pp. 1908-1913. 2006. [pdf]
  
Download: The two corpora can be downloaded from the link:
 
                  http://sourceforge.net/projects/arabiccorpus/files/

 
 Topic Corpus Size  (Number of documents)
 International News
 953
 Local News
 2398
 Economy
 909
 Sports  1430
 Total number of docs
 5690

Watan-2004 corpus

Watan-2004 corpus contains about 20000 articles talking about the six following topics "categories":
Culture, Religion, Economy, Local News, International News and sports.
In this corpus, punctuation has been omitted intentionally in order to make it useful for Language Modeling. My works based on Watan corpus can be found out in the link Publications.


 Topic  Corpus Size  (Number of documents)
Culture
2782
 Religion  3860
 Economy  3468
 Local News
 3596
 International News
2035
 Sports 4550
Total number of docs
 20291


N.B: This corpus is only for scientific use.