1st Competition on Script Identification in the Wild (SIW 2021)

in conjunction with

ICDAR 2021

With the ever-increasing demand for the creation of a digital world, many Optical Character Recognition (OCR) algorithms have been developed over the years. A script can be defined as the graphic form of the writing system used to write a statement. The availability of large numbers of scripts makes the development of a universal OCR a challenging task. This is because the features needed for character recognition are usually a function of structural script properties and of the number of possible classes or characters. The extremely high number of available scripts makes the task quite daunting and sometimes deterring, and as a result, most OCR systems are script-dependent. The approach for handling documents in a multi-script environment is divided into two steps: first, the script of the document, block, line or word is estimated, and secondly, the appropriate OCR is used. This approach requires a script identifier and a bank of OCRs, at a rate of one OCR per possible script. Many script identification algorithms have been proposed in the literature. Script identification can be conducted either offline, from scanned documents, or online, if the writing sequence is available. Identification can also be classified either as printed or handwritten, with the latter being the more challenging. Script identification can be performed at different levels: page or document, paragraph, block, line, word, and character. As it is similar to any classical classification problem, the script identification problem is a function of the number of possible classes or scripts to be detected. Furthermore, any similarity in the structure of scripts represents an added challenge.

The benchmarking works on script identification in the literature uses different datasets with different script combinations. Therefore, it is difficult to carry out a fair comparison of these different approaches. Moreover, the databases employed in related studies usually include two to four scripts. A few actually include an even higher number of scripts. Hence to alleviate this drawback, in this competition we aim to offer a database for script identification, which consists of a wide variety of some the most commonly used scripts, collected from real-life printed and handwritten documents. The printed documents in the database were obtained from local newspapers and magazines, and therefore, comprise different fonts and sizes and cursive and bold text. The handwritten part was obtained from volunteers from different parts of the world, who scanned and shared their manuscripts.

The detailed task for the competition are as follows:

Task 1. Script identification in handwritten document

Task 2. Script identification in printed document

Task 3. Mixed script identification: Train and tested with handwritten and printed

Participents will perform word level script recogntion.

The submission which will achieve the best average accuracy for handwritten, printed, cross and mixed script recognition (regardless of the label script/word/line) will be considered as the winner.


Organizers Information:

Abhijit Das (Indian Statistical Institute, Kolkata) : email- abhijitdas2048@gmail.com

Umapada Pal (Indian Statistical Institute, Kolkata) email- umapada@isical.ac.in

Migue A, Ferrer (Univ. de Las Palmas de Gran Canaria, Spain) email-miguelangel.ferrer@ulpgc.es

MoisesDiaz (Universidad del Atlantico Medio, Spain ) email-moises.diaz@atlanticomedio.es

Aythami Morales (Universidad Autónoma de Madrid, Spain) email-aythami.morales@uam.es

Donato Impedovo (Università degli Studi di Bari Aldo Moro ) email- donato.impedovo@uniba.it

Schedule:

Different Phases of the competition Dates

Site opens 1st Oct 2020

Registration starts and train dataset available 20th Nov 2020

Test dataset available 11th April 2021

Registration closes 27th April 2021

Algorithm submission deadline 27th April 2021

Results and report announcement 3rd May 2021


How to participate?

Registration for the competition can be done by email. If you would like to register and receive the training dataset, please send an email to abhijitdas2048@gmail.com with the subject line as "SIW 2021 registration" with the following information:

Name,

Affiliation,

Email,

CV ,

Signed version of the license form.


Description of the evaluation criteria (performance metrics) and available baseline implementations:

The proposed algorithms submitted by the participants will be evaluated by the organizer. They need to submit the confidence score of the script classification in a .csv file and the code. The evaluation measures will be recognition accuracy, equal error rate, and F1 score. The ground truth of the manually annotated and manual segmented word and line region are constructed which will be used.

Details on the experimental protocol and result generation/submission procedure: is vailable at: https://competitions.codalab.org/competitions/27802


Benchmark datasets:

In this competition we introduce a database, which contains both printed and handwritten documents obtained from a wide variety of scripts, such as Arabic, Bengali, Gujarati, Gurmukhi, Devanagari, Japanese, Kannada, Malayalam, Oriya, Roman, Tamil, Telugu and Thai. The dataset consists of 1137 documents scanned from local newspapers, as well as handwritten letters and notes. Further, these documents are segmented into lines and words, comprising a total of 13,983 and 86,675 lines and words, respectively, in the dataset. In addition, we showcase a benchmarking of the proposed dataset, using methods based on classic texture features, namely, Local Binary Patterns and Quad-Tree Histogram of Templates, Dense Multi-Block Local Binary Patterns and deep features obtained by two Deep Neural Network architectures. Furthermore, we include a discussion and analysis of the benchmarking results, which will facilitate the understanding of the dataset and which is expected to elicit new developments and insights in this research area.