Extended Arabic Web Spam 2011 Dataset

Web spamming can be defined as the actions that mislead search engines into ranking some pages higher than they deserve, these results in degradation of the information quality on the Web, placing the users at risk for exploitation by Web spammers and damaging the reputation of search engines as they weaken the trust of their users.

The first Arabic Web Spam corpus built in (Wahsheh H. A., and Al-Kabi, M. N. (2011). Detecting Arabic Web Spam, The 5th International Conference on Information Technology, ICIT'11, May 11 – 13, 2011.)

This corpus of Arabic Spam Webpages based on the previous corpus, enhanced both the number of Arabic Spam pages, and their content-based features.

The Arabic Web Spam Corpus was collected by Heider A. Wahsheh, as a part of Arabic Web spam detection researches, during the time period from April 2011 to Augest 2011, it is considers as a first publicly available of Arabic Web Spam dataset, it is extract a 11 content-based features for 10,000 Arabic Spam Webpages.

Please cite our paper (

Wahsheh H., Abu Doush I., Al-Kabi M., Alsmadi I. and Al-Shawakfa E. (2012), Using Machine Learning Algorithms to Detect Content-based Arabic Web Spam, International Journal of Information Assurance and Security (JIAS), 7 (1): 14-24.) if you use Web Spam 2011 Datasets in your publication.