Dataset
This original corpus consist in OCRed documents from 10 European languages with about 20M characters (3.5M tokens) aligned with their corresponding Gold Standard (Ground-Truth). Each language contain one or several sub-folders (unbalanced) according to collected dataset sources as follows:
Each training file contain three blocs according to the following structure. Note that only the first block [OCR_output] will be included in the test set.
References
References
- IMPACT, European Commission’s 7th Framework Program, grant agreement 215064
- Uwe Springmann, Christian Reul, Stefanie Dipper, Johannes Baiter (2018). Ground Truth for training OCR engines on historical documents in German Fraktur and Early Modern Latin.
- https://digi.nationallibrary.fi , Wiipuri, 31.12.1904, Digital Collections of National Library of Finland
- EU Horizon 2020 research and innovation programme grant agreement No 770299
Origins and copyrights related to every text are detailed in the full version of the dataset.