Dataset

This original corpus consist in OCRed documents from 10 European languages with about 20M characters (3.5M tokens) aligned with their corresponding Gold Standard (Ground-Truth). Each language contain one or several sub-folders (unbalanced) according to collected dataset sources as follows:

POCR_dataset_info

Each training file contain three blocs according to the following structure. Note that only the first block [OCR_output] will be included in the test set.

Download links

ZIP: ICDAR2019_POCR_competition_dataset.zip

Zenodo: https://zenodo.org/record/3515403

TC-11: http://tc11.cvc.uab.es/datasets/Post-OCR_2019_1

References

IMPACT, European Commission’s 7th Framework Program, grant agreement 215064
Uwe Springmann, Christian Reul, Stefanie Dipper, Johannes Baiter (2018). Ground Truth for training OCR engines on historical documents in German Fraktur and Early Modern Latin.
https://digi.nationallibrary.fi , Wiipuri, 31.12.1904, Digital Collections of National Library of Finland
EU Horizon 2020 research and innovation programme grant agreement No 770299

Origins and copyrights related to every text are detailed in the full version of the dataset.

Google Sites

Report abuse