Challenge

This competition invites researchers from any field that can be applied to document analysis (e.g. natural language processing, data analysis, text data mining...) to challenge their method(s) for improving/denoising OCR-ed texts, on a testbed of more than 20 million characters. Given the noisy OCR of printed text from different sources and languages (English, French, German, Finish, Spanish, Dutch, Czech, Bulgarian, Slovak and Polish), the participants will be proposed to participate in two tasks, which can be performed independently. They will rely on different parts of the data set, to avoid the risk of any bias in their respective evaluation. It should additionally be noted that we will be able to compute separate scores for each language of the collection, allowing for the evaluation of language-specific approaches.

Task 1 - Detection of OCR errors

Given the raw OCR-ed text, the participants are asked to provide the position and also the length of the suspected errors. The length information is non-trivial; although it is often recovered based on word boundaries, it could vary on some occasions (e.g. wrongly OCR-ed separators such as spaces, hyphens or line breaks).

Task 2 - Correction of OCR errors

Given the OCR errors in their context, the participants are asked to provide, for each error, either a) only one correction or b) a ranked list of candidates for correction. Providing multiple candidates enables the evaluation of semi-automated techniques. We will thus take into account and evaluate 2 families of systems:

  1. "Fully-automated" systems, meant for the comparative evaluation of fully automatic OCR correction tools, where we only take into account one correction candidate;
  2. "Semi-automated" systems, meant for the comparative evaluation of human-assisted correction tools, where a person typically picks the right correction within a list of system-generated candidate corrections (in that case, the higher-ranked the right correction, the better the system).