Dataset

This original corpus consist in OCRed documents from 10 European languages with about 20M characters (3.5M tokens) aligned with their corresponding Gold Standard (Ground-Truth). Each language contain one or several sub-folders (unbalanced) according to collected dataset sources as follows:

POCR_dataset_info

Each training file contain three blocs according to the following structure. Note that only the first block [OCR_output] will be included in the test set.

References

  • IMPACT, European Commission’s 7th Framework Program, grant agreement 215064
  • Uwe Springmann, Christian Reul, Stefanie Dipper, Johannes Baiter (2018). Ground Truth for training OCR engines on historical documents in German Fraktur and Early Modern Latin.
  • https://digi.nationallibrary.fi , Wiipuri, 31.12.1904, Digital Collections of National Library of Finland
  • EU Horizon 2020 research and innovation programme grant agreement No 770299

Origins and copyrights related to every text are detailed in the full version of the dataset.