Evaluation

The evaluation will be made by comparing the participant system outputs to the Ground Truth. Each task will have its own evaluation measures, as detailed hereafter. In summary, we will provide evaluation and measurements for:

  • OCR detection (Task 1), position and length of the suspected errors ;

  • OCR correction (Task 2) fully-automated (one correction candidate) and semi-automated (ordered list of correction candidates) ;

  • Each of the 10 languages document set (BG, CZ, DE, EN, ES, FI, FR, NL, PL, SL) separately and for the full data set at once (to evidence the language dependency) ;

  • Each of the 18 sub-language document set separately (e.g. BG1, CZ1, DE1, DE2, DE3, etc.) to evidence the writing style and dataset dependency.


Input & Output formats

Download an example for English HERE. This .zip archive contains a folder with 2 input training text file examples (provided by organizers) and a JSON output file example (submitted by participants).

Metrics

The script used to evaluate the participants' results is provided: https://gitlab.univ-lr.fr/crigau02/icdar2019-post-ocr-text-correction-competition

Task 1) Error detection:

The detection task will be evaluated based on recall, precision and F-measure, as it is purely a matter of tokens being truly erroneous or not. The ranking will be made on the F-measure.

Task 2) Error correction:

As mentioned earlier, the correction task involves a list of candidate words for each error and will be evaluated on two different scenarios:

  • "fully automated" scenario, taking into consideration only the highest-weighted word in each list;

  • "semi-automated" scenario, exploited all the proposed corrections along with their weight.

The chosen metric considers for every token, a weighted sum of the Levenshtein distances between the correction candidates and the corresponding token in the Ground Truth. So the purpose is to minimize that distance over all the tokens.

Important notes :

  • [OCR_aligned] and [ GS_aligned] are provided in the training set, but wont be given in the evaluation set.

  • Removing alignment symbols "@" from the [OCR_aligned] recovers exactly the [OCR_toInput]

  • Tokens are simply space separated sequences, with no restriction on punctuation. Examples of tokens : "i", "i'am", "football?", "qm87-7lk_.,qs'&"

  • Tokens which are aligned with "#" symbol(s) in the Gold Standard will be ignored in the metrics.

  • Given the complexity of dealing with hyphens corrections, it has been decided to ignore the hyphens related tokens during the evaluation. So whether you correct or not theses errors do not impact the final result. So whether you correct or not theses errors does not impact the final result.