Download Labelled Datasets
Available Datasets
Available Datasets
- Content-based Categorized Dataset (996 labelled Arabic Web pages -- available below)
- Dialect-labelled Dataset (will be available soon)
Content-based Categorized Dataset
Content-based Categorized Dataset
To estimate the proportion of different Web page types in ArabicWeb16, we sampled 996 Web pages from the dataset and labeled them using CrowdFlower (check our SIGIR 2016 paper for details).
This dataset can be used for text classification research or any other related problems.
- Download the labels file (formatted as <ARABICWEB16-DOC-ID>\t<Label>\t<URL>).
- Download the labelled pages (14 MB) in one WARC file. (Check Getting Started page on how to read it).
The categories of the Web pages are:
- Informational: Web pages whose main purpose is to provide information (e.g., Wikipedia). Information can vary from scientific articles to event schedules.
- Discussion & Opinion: Web pages with discussions, opinions, interviews, etc., often on social platforms.
- News and Media: Web pages that provide different topics of news and articles from around the world.
- Online Services: Web pages for online applications, or platforms for payment and shopping. These Web pages may list services or products to buy or use, user-guides, etc.
- Organizational: Institutional Web pages describing owners’ interests, activities, news or services, etc.
- Entertainment: Web pages with a main purpose to provide entertainment to users (e.g., games, movies).
- Other: Web pages not fitting any of the above types.