This competition invites researchers from any field that can be applied to document analysis (e.g. natural language processing, data analysis, text data mining...) to challenge their method(s) for improving/denoising OCR-ed texts, on a testbed of more than 20 million characters. Given the noisy OCR of printed text from different sources and languages (English, French, German, Finish, Spanish, Dutch, Czech, Bulgarian, Slovak and Polish), the participants will be proposed to participate in two tasks, which can be performed independently. They will rely on different parts of the data set, to avoid the risk of any bias in their respective evaluation. It should additionally be noted that we will be able to compute separate scores for each language of the collection, allowing for the evaluation of language-specific approaches.
Given the raw OCR-ed text, the participants are asked to provide the position and also the length of the suspected errors. The length information is non-trivial; although it is often recovered based on word boundaries, it could vary on some occasions (e.g. wrongly OCR-ed separators such as spaces, hyphens or line breaks).
Given the OCR errors in their context, the participants are asked to provide, for each error, either a) only one correction or b) a ranked list of candidates for correction. Providing multiple candidates enables the evaluation of semi-automated techniques. We will thus take into account and evaluate 2 families of systems: