Getting Started

In this page, we show how you can read (or process) WARC files to get through all Web pages in those files.

1. Download sample ArabicWeb16 WARC files

You can download sample files (786 MB) of around 62K Web pages from here.

2. Read the sample files

To read the files you will need to download two Java classes: WarcHTMLResponseRecord.java and WarcRecord.java. Those files are slightly-modified versions of the ClueWeb09 dataset reader.

We also provide a test reader class TestWarcReader.java that iterates over WARC files in a given directory and reads the WARC records (Web pages) in each of them.

You can download all the 3 java files in a zip file from here.

To test reading the sample file:

Create a directory called "reader".
Create a sub-directory called "src" and put there the 3 java classes above (2 reader classes and 1 test class).
Create a sub-directory called "sample" and add the sample WARC files inside it.
Create a sub-directory called "classes".
Compile the code and run the test class as follows:

javac -sourcepath src -d classes src\*.java

java -Dfile.encoding=UTF8 -cp classes TestWarcReader sample

If the test runs successfully, you will see a list of document IDs and (towards the end) you get the following line:

Successful test. Number of read pages = 61934

Check the test code to see how you can get several information from each WARC record such as the corresponding document id, target URL, and the raw content.

Notes:

The Arabic content is encoded in UTF-8 format (where proper UTF-8 character encodings apply). You will have to handle the reading of the content accordingly.
We made a considerable effort to avoid any content that might be inappropriate, however we offer no warranty about the content of the dataset.