Dataset Details

The total size of the dataset is about 2.0TB (2,042,875 MB) in gzip compressed format and about 10.9TB in uncompressed format.

It includes 150M (150,211,934) Arabic Web pages.

Web pages in ArabicWeb16 are collected into files that conform to the WARC ISO 28500 version 0.18 standard ("WARC files"). The dataset contains 3005 WARC files that are compressed with gzip. More information about the format can be found here.

The files have numerical names from 0000-3004. Each WARC file contains exactly 50K WARC records (i.e., Web pages in various forms such as HTML, XML and RSS feeds). The WARC files are partitioned into 31 directories named from 00-30, each has 100 WARC files (total of exactly 5M Web pages) except the last one (has only 5 WARC files).

A WARC file contains header and content. The WARC header has a custom field, WARC-DOC-ID, that is a unique identifier for each Web page in ArabicWeb16.

The size of WARC directories of compressed WARC files is as follows:

  • 58G /ArabicWeb16/00
  • 65G /ArabicWeb16/01
  • 65G /ArabicWeb16/02
  • 67G /ArabicWeb16/03
  • 66G /ArabicWeb16/04
  • 66G /ArabicWeb16/05
  • 68G /ArabicWeb16/06
  • 71G /ArabicWeb16/07
  • 70G /ArabicWeb16/08
  • 73G /ArabicWeb16/09
  • 73G /ArabicWeb16/10
  • 73G /ArabicWeb16/11
  • 73G /ArabicWeb16/12
  • 74G /ArabicWeb16/13
  • 73G /ArabicWeb16/14
  • 75G /ArabicWeb16/15
  • 76G /ArabicWeb16/16
  • 68G /ArabicWeb16/17
  • 66G /ArabicWeb16/18
  • 70G /ArabicWeb16/19
  • 59G /ArabicWeb16/20
  • 64G /ArabicWeb16/21
  • 60G /ArabicWeb16/22
  • 59G /ArabicWeb16/23
  • 61G /ArabicWeb16/24
  • 62G /ArabicWeb16/25
  • 63G /ArabicWeb16/26
  • 63G /ArabicWeb16/27
  • 63G /ArabicWeb16/28
  • 60G /ArabicWeb16/29
  • 2.9G /ArabicWeb16/30