Frequently asked questions

LAST UPDATE: 27/06/2017, 5pm (UTC)

Could you elaborate a bit on the Task 2 metric ?

o The metrics are implemented in "evalTool_ICDAR2017.py" (https://git.univ-lr.fr/gchiro01/icdar2017/tree/master) and can be experimented on the training set. Regarding the task 2: For every token, a weighted sum of the Levenshtein Distances (LD) between the correction candidates and the corresponding token in the Ground Truth is computed. So the purpose is to minimize that distance over all the tokens.

 __________ Candidate 1  _______ +  _______ Candidate 2  ________   +        ...
LD('I NEVER' , 'I NEVER') * 0.9  +  LD('I EVER', 'I NEVER') * 0.1   +     ...     

/!\ If the sum of the candidates' weights provided is not equal to 1, then it is automatically normalized.


Some error positions (and/or lengths) given in "erroneous_tokens_pos.json" for Task 2 seem not correct.

o Those cases are often related to hyphenations (e.g. eng_periodical/3.txt@11254:1 "misfor tune", eng_periodical/3.txt@11922:1 "communi cation" ). Given the complexity of dealing with hyphen corrections, it was decided anyway to ignore the hypen related tokens during the evaluation. So whether you correct or not theses errors do not impact the final result.

Questions asked during the 2017 edition


Can we send more than one submission per team?

o Multiple submissions are allowed in case of your team has multiple on-shelf systems. This involves creating, for each submission, an independent ZIP archive (results accompanied with both, short and long summary descriptions). However, submission of the same method several times with different parameter sets is not allowed.


Why 2 distinct and independent tasks “Task 1) Detection” and “Task 2) Correction” rather than one full task?

o It gives the opportunity to institutions who do not have a full correction system to participate to a subtask.

o It allows the bypass of the "Task 1) Detection" which, given the noisiness of the dataset, could lead to low scores in the training phase.

o It restrains the challenge over a reasonable number of 2 metrics, implying 2 rankings.


Why not ask for the fully corrected text for each file?

o Comparing a fully corrected text with the corresponding Ground Truth would require to apply an alignment process to every participant results, which could be the source of inconsistencies in case of misalignments.

The correction of spaces, hyphens or inconsistent sequences seem like flipping a coin in terms of results.

o That’s why correction candidates (good or bad) proposed on hyphens or inconsistent sequences (often not well aligned with the ground truth) are not taken into account by the evaluation script.


Is it allowed to train our model(s) over external sources (data, dictionaries, lexicons)?

o Yes, feel free to train your model(s) on external resources. Please, don’t forget to give details in the summary description of your approach.

Is it allowed to use different models/training datasets for each different folder?

o Yes, but please do not forget to mention it in the summary description of your approach.

Regarding historical variants of words, will corrections proposed with modernized spellings be considered as a penalty?

o Yes, the evaluation of the Correction Candidates (CCs) blindly relies on the Levenshtein distance between the CCs and the corresponding sequence in the ground truth.

What is the encoding of the files and what range of characters should we expect in them?

o The files are encoded in UTF-8. The range of characters depends on the OCR engine itself (over which we have no control). However, we have normalized common/similar characters such as [ — – - ] => - , [ “ ” " ] => " , [ ‘ ’ ] => '

How to compute the offset alignment in the evaluation phase, as we only have the [OCR_toInput] form:

o Actually, the offsets asked in the result are [OCR_toInput] related. No need to pay attention the [OCR/GS_aligned] related offsets as the evaluation script is able to convert them automatically.