ArabicWeb16 Dataset

This website is about ArabicWeb16 dataset, the largest Arabic Web dataset (150M pages) that is publicly available!

ArabicWeb16 dataset is a public Web crawl of 150,211,934 Arabic Web pages with high coverage of dialectal Arabic as well as Modern Standard Arabic (MSA). We expect ArabicWeb16 to support various research areas such as ad-hoc search, question answering, filtering, cross-dialect search, dialect detection, entity search, blog search, and spam detection among others.

Check our SIGIR 2016 paper that fully describes the dataset:

Reem Suwaileh, Mucahid Kultlu, Nihal Fathima, Tamer Elsayed, and Matthew Lease. ArabicWeb16: A New Crawl for Today’s Arabic Web. Proceedings of the 39th annual international ACM SIGIR conference on Research and development in information retrieval: SIGIR ’16, pp. 673-676, Pisa, Italy, July 2016.

To get access to the dataset, check the Download ArabicWeb16 page.

To download labelled datasets related to ArabicWeb16, check the Download Labelled Datasets page.

To start processing the dataset, check Getting Started page.

To interact with the dataset (without downloading it), check the Online Services page.