Evaluation

The evaluation will be made by comparing the participant system outputs to the Ground Truth. Each task will have its own evaluation measures, as detailed hereafter. In summary, we will provide evaluation and measurements:

  • for OCR detection (Task 1), fully-automated and semi-automated OCR correction (Task 2);
  • for English and French documents separately, and for the full data set at once (to evidence the performance of language agnostic techniques).


Input & Output formats

Download the example below : HERE

Metrics

The script used to evaluate the participants' results is provided: https://git.univ-lr.fr/gchiro01/icdar2017/tree/master

Important notes :

  • [OCR_aligned] and [ GS_aligned] are provided in the training set, but wont be given in the evaluation set.
  • Removing alignment symbols "@" from the [OCR_aligned] recovers exactly the [OCR_toInput]
  • Tokens are simply space separated sequences, with no restriction on punctuation. Examples of tokens : "i", "i'am", "football?", "qm87-7lk_.,qs'&"
  • Tokens which are aligned with "#" symbole(s) in the Gold Standard will be ignored in the metrics.
  • Given the complexity of dealing with hyphens corrections, it has been decided to ignore the hyphens related tokens during the evaluation. So whether you correct or not theses errors do not impact the final result. So whether you correct or not theses errors does not impact the final result.


Task 1) Error detection :

The detection task will be evaluated based on recall, precision and F-measure, as it is purely a matter of tokens being truly erroneous or not. The ranking will be made on the F-measure.

Task 2) Error correction :

As mentioned earlier, the correction task involves a list of candidate words for each error and will be evaluated on two different scenarios:

  • "fully automated" scenario, taking into consideration only the highest-weighted word in each list;
  • "semi-automated" scenario, exploited all the proposed corrections along with their weight.

The chosen metric considers for every token, a weighted sum of the Levenshtein distances between the correction candidates and the corresponding token in the Ground Truth. So the purpose is to minimize that distance over all the tokens.