U-DIADS-TL (Uniud - Document Image Analysis DataSet - Text Line version) is a dataset specifically designed for text line segmentation in ancient manuscripts. U-DIADS-TL provides noise-free annotations with non-overlapping text elements and accommodates diverse document structures, including multi-column layouts.
Description
U-DIADS-TL is a proprietary dataset developed through the collaboration of computer scientists and the humanities at the University of Udine. It comprises three distinct ancient manuscripts (Latin 2, Latin 14396, and Syriaque 341), with 28 images selected from each. These unique color page images were carefully selected from each manuscript and divided into three subsets: 3 images for training, 10 for validation, and 15 for testing. In particular, the images of the four manuscripts were collected from the digital library Gallica. All manuscripts are Latin and Syriac Bibles published between the 6th and 12th centuries A.D.
Paris, Bibliothèque nationale de France, the Second Bible of Charles the Bald, Latin 2
Paris, Bibliothèque nationale de France, Genesis-Kings, the First Volume, Latin 14396
Paris, Bibliothèque nationale de France, the Old Testament in the Syriac Peshitta version, Syriaque 341
U-DIADS-TL consists of 28 unique color page images for each manuscript, saved in JPEG format at a resolution of 1344×2016 pixels. Each page is paired with its respective Ground Truth (GT) data, stored in a PNG image of identical size to the original. The GTs comprise two distinct and non-overlapping annotated classes, background and text lines.
The manuscript Latin 2 comprises 2,741 lines of text: 317 lines were allocated to the training set, 955 to the validation set, and 1,469 to the test set.
The manuscript Latin 14396 contains 2,183 lines, with 232 lines for training, 785 for validation, and 1,166 for testing.
The manuscript Syriaque 341 is the most extensive, totaling 4,836 lines: 463 lines in the training set, 1,747 in the validation set, and 2,626 in the test set.
Following are illustrated sample pages from the three manuscripts:
Latin 2
Latin 143962
Syriaque 341
And their corresponding Ground Truths, where each color in the GT samples represents a different instance of text line segmentation.
Latin 2
Latin 143962
Syriaque 341
Citation
For any scientific publication using this data, the following paper should be cited:
TBA