|
I have prepared this corpus for realizing experiments on Topic Identification for Arabic language. It has been extracted from thousands of articles which had been downloaded from an online newspaper.
The corpus contains more than 5000 articles which correspond to nearly 3 millions words.
Punctuation has been deleted on purpose. For more information, see the works based on Khaleej corpus:
Download: The two corpora can be downloaded from the link:
Watan-2004 corpus Watan-2004 corpus contains about 20000 articles talking about the six following topics "categories": Culture, Religion, Economy, Local News, International News and sports. In this corpus, punctuation has been omitted intentionally in order to make it useful for Language Modeling. My works based on Watan corpus can be found out in the link Publications.
N.B: This corpus is only for scientific use. |