Download Labelled Datasets

Available Datasets

  • Content-based Categorized Dataset (996 labelled Arabic Web pages -- available below)
  • Dialect-labelled Dataset (will be available soon)

Content-based Categorized Dataset

To estimate the proportion of different Web page types in ArabicWeb16, we sampled 996 Web pages from the dataset and labeled them using CrowdFlower (check our SIGIR 2016 paper for details).

This dataset can be used for text classification research or any other related problems.



The categories of the Web pages are:

  • Informational: Web pages whose main purpose is to provide information (e.g., Wikipedia). Information can vary from scientific articles to event schedules.
  • Discussion & Opinion: Web pages with discussions, opinions, interviews, etc., often on social platforms.
  • News and Media: Web pages that provide different topics of news and articles from around the world.
  • Online Services: Web pages for online applications, or platforms for payment and shopping. These Web pages may list services or products to buy or use, user-guides, etc.
  • Organizational: Institutional Web pages describing owners’ interests, activities, news or services, etc.
  • Entertainment: Web pages with a main purpose to provide entertainment to users (e.g., games, movies).
  • Other: Web pages not fitting any of the above types.