Dataset

The competition will be carried out on a filtered version of the KERTAS dataset [3,4]. The dataset includes various historical Arabic manuscripts that span over 14 Islamic centuries and cover multiple topics: science, literature, poetry, metaphysics, and religion. The dataset contains a set of images with hand-drawn figures and tables in different colors. Some samples also contain additional comments written in the margin by different authors and in different styles. This makes the dataset challenging enough for the evaluation of participating systems and the identification of their potential limitations. The following are samples of those specific images.

The filtering process of the original KERTAS dataset [3,4] incorporated removing duplicates and images without texts (like the front and end covers of manuscript books where images were taken from). The resulting dataset contains 1688 distinct images in JPG format in arbitrary size. The dataset is split into three separate sets: the first set is prearranged for training and contains 1063 images; the second set comprises 170 distinct images planned for validation; and the last set comprises 455 thoroughly selected images which are not included neither in the training nor in the validation sets. This latter will be used for evaluating the performance of contributed systems. All three sets contain samples from all the 14 Islamic centuries. The following table shows the details about the distribution of samples over the 14 Islamic centuries for the case of training and validation sets.

Images with specific test formats are distributed over the three sets (training, validation and test). Specifically, images with hand-drown figures represent 5% of the entire dataset, where images with tables and texts in margins represent 2% and 12% of the dataset respectively.

Dataset links: