Corpus of Arabic Link Spam

Corpus of Arabic Link Spam

This is the first Arabic Link Spam Corpus, it is collected by Heider A. Wahsheh, as a part of Arabic Web spam detection researches, used in (Wahsheh H. A., Al-Kabi M. N., and Alsmadi I. (2012). Evaluating Arabic Spam Classifiers Using Link Analysis, The 3rd International Conference on Information and Communication Systems (ICICS 2012), ACM, Irbid, Jordan, (April 3-5, 2012)). It is consist of around 3,000 Arabic link spam Web pages, during the time period from April 2011 to September 2011.

Web Link Validator tool was used; which is powerful, comprehensive site management and link checker tool that helps webmasters automate the process of website testing. The software performs a thorough analysis of website pages and includes the following checks: finds broken links, HTML coding errors, slow-loading, and outdated, we have used this tool with the condition, which provides retrieve the links from the specified page only:

Classification Web pages based on having access to set of features extracted from URLs to distinguish between spam and non-spam. We can summarize the features used as following:

1. The number of external links within the Web page under consideration.

2. The number of internal links within the Web page under consideration.

3. The total number of links (the internal and external) within the Web page under consideration.

4. The total number of good (worked) links, and the number of internal and external good links within the Web page under consideration.

5. The total number of broken links, and the number of internal and external broken links within the Web page under consideration.

6. The total number of redirected links and the number of internal and external redirected links within the Web page under consideration.

7. The percent of each links (internal, external, good, broken, and redirect) within the Web page under consideration.