First draft by Birhanu Hailu Belay and Haozhe Sun
Contributors: Birhanu Hailu Belay, Haozhe Sun, Hui Zhang
OCR is a complex visual recognition and analysis system, which includes many sub-tasks. We summarize some of the common OCR datasets below:
MNIST Database of Handwritten digits
Link for dataset: http://yann.lecun.com/exdb/mnist/
Reference and Link: Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. "Gradient-based learning applied to document recognition." Proceedings of the IEEE, 86(11):2278-2324, November 1998.
Date created : 1998
Comments:
This dataset is a very popular dataset in computer vision and image processing in general and specifically in digit recognition recognition.
digit recognition
This datasets has bee used for benchmarking machine learning algorithms.
This dataset consists of isolated digit images, it is limited for digit recognition and the image size is fixed. in addition this digits are written with small group of people and less variations
Quantitative numbers:
For datasets:
Volume =12MB compressed file
Number of examples: 60,000 image for training, and 10,000 images for test
Number of classes or labels: 10 digits
IAM On-line Handwriting:
Link for dataset: https://fki.tic.heia-fr.ch/databases/iam-on-line-handwriting-database
Reference and Link: Liwicki, M. and Bunke, H.: IAM-OnDB - an On-Line English Sentence Database Acquired from Handwritten Text on a Whiteboard. 8th Intl. Conf. on Document Analysis and Recognition, 2005, Volume 2, pp. 956 - 961
Date created : 2005
Comments:
Contains forms of handwritten English text acquired on a whiteboard, and includes more than 1700 forms.
used for online handwritten text recognizers, writer identification, and writer verification.
IAM On-line Handwriting dataset can be used for various tasks including handwritten text recognizers, writer identification, and writer verification. It has been widely used by many researchers to-date.
Quantitative numbers:
For datasets:
Volume = ~60 MB compressed data
Number of examples. 13,049 isolated and labeled text line images
Number of classes or labels: 58 characters for recognition and 221 for writers identification and verification
NIST Special Database 19:
Link for dataset: https://www.nist.gov/srd/nist-special-database-19
Reference and Link: Grother, Patrick, and Kayee Hanaoka. "Nist special database 19 handprinted forms and characters 2nd edition." National Institute of Standards and Technology, Tech. Rep 13 (2016).
Date created : 2016 (2nd edition)
Comments:
This dataset contains the full page binary images of 3669 Handwriting Sample Forms and 814 255 segmented digit and alphabetic characters that are organized to be suited to different recognition applications.
used for the task of character recognition
This datasets has a great flexibility to experiment various tasks since it is organized in various forms e.g by author, filed, class and pages.
This dataset consists of segmented characters and digit image that are annotated for character level recognition.
Quantitative numbers:
For datasets:
Volume = 984MB zip file
Number of examples: 814 255 segmented digit and alphabetic
Number of classes or labels: 62 characters
DIDA
Link for dataset: https://didadataset.github.io/DIDA/#digitnet-model-and-weights
Reference and Link: Huseyin Kusetogullari, Amir Yavariabdi, Johan Hall, Niklas Lavesson, “DIGITNET: A Deep Handwritten Digit Detection and Recognition Method Using a New Historical Handwritten Digit Dataset”, Big Data Research, 2020,.
Date created : 2020
Comments:
This dataset is the largest historical handwritten digit dataset with weights of trained models. in addition images are in their original sizes and appearance.
historical handwritten digit recognition
This datasets are collected from 18th and 19th century Swedish documents and it can be used by the OCR community to help the researchers to test their optical handwritten character recognition methods.
This dataset consists of isolated historical digit images which might not be useful for word level and/or text-line level recognition task
Quantitative numbers:
For datasets:
Volume =861MB compressed file
Number of examples: 250k digits
Number of classes or labels: 10 digits
COCO-Text Dataset
Link for dataset: https://vision.cornell.edu/se3/coco-text-2/
Reference and Link: Veit, Andreas; Matera, Tomas; Neumann, Lukas; Matas, Jiri; Belongie, Serge COCO-Text: Dataset and Benchmark for Text Detection and Recognition in Natural Images arXiv preprint arXiv:1601.07140, 2016.
Date created : 2016(v1)
Comments:
it is a large-scale dataset for text in natural images and also the first dataset to annotate scene text with attributes such as legibility and type of text.
text detection and recognition in natural images
Quantitative numbers:
For datasets:
Volume =12MB compressed file
Number of examples: 63000 images and 145000 text instances
Number of classes or labels: labels are grouped into 3 tasks ( Class, Language and Legibility). 2 classes ( Machine Printed or Handwritten). 3 Language classes ( English, Not English or N/A). and 2 other classes( 2 values: Legible or Illegible)
The Street View Text Dataset
Link for dataset: http://www.iapr-tc11.org/mediawiki/index.php?title=The_Street_View_Text_Dataset
Reference and Link: Kai Wang, Boris Babenko and Serge Belongie, "End-to-end Scene Text Recognition", ICCV 2011
Date created : 2011(v1)
Comments:
this dataset was harvested from Google Street View, from outdoor street level signs and boards
word level detection and recognition in natural images
image text in this data exhibits high variability and often has low resolution
Quantitative numbers:
For datasets:
Volume =118MB compressed file
Number of examples: 725 labeled words
Number of classes or labels: 62 characters for the task of recognition and bounding box of each word for the task to detection
TextOCR Dataset:
Link for dataset: https://textvqa.org/textocr/
Reference and Link: Sidorov, Oleksii, Ronghang Hu, Marcus Rohrbach, and Amanpreet Singh. "Textcaps: a dataset for image captioning with reading comprehension." In European conference on computer vision, pp. 742-758. Springer, Cham, 2020.
Date created : 2021
Comments:
This dataset consists of an arbitrary shaped text on natural images for the task of text image recognition and detection.
it also consist of images for the task of visual question answering and image captioning tasks.
Quantitative numbers:
For datasets:
Volume = ~7GB compressed file
Number of examples: 903,069 labeled word
Number of classes or labels: Bounding box of each word for the task to detection and all Latin characters for the task of recognition.
CROHME: Online Handwritten Mathematical Expressions
Link for dataset: http://www.iapr-tc11.org/mediawiki/index.php?title=CROHME:_Competition_on_Recognition_of_Online_Handwritten_Mathematical_Expressions
Reference and Link: Mouchère H., Viard-Gaudin C., Zanibbi R., Garain U., Kim D. H., Kim J. H., "ICDAR 2013 CROHME: Third International Competition on Recognition of Online Handwritten Mathematical Expressions", International Conference on Document Analysis and Recognition, USA (2013)
Date created : 2013 (v2)
Comments:
to acquire the mathematical symbols different devices have been used including digital pen technologies, white-board input device, tablet with sensible screen.
this dataset consists of different scales and resolutions.
this dataset has been used in competition and consists of 4 levels having from 41-101 symbols that increasing difficulties in the grammar of allowed expressions.
Quantitative numbers:
For datasets:
Volume = ~130Mb compressed file
Number of examples: 10,000 expressions handwritten by hundreds of writers
Number of classes or labels: Latex ground truth up to 101 symbols
NEOCR: Natural Environment OCR Dataset
Link for dataset: http://www.iapr-tc11.org/mediawiki/index.php?title=NEOCR:_Natural_Environment_OCR_Dataset
Reference and Link: R. Nagy, A. Dicker and K. Meyer Wegener, "NEOCR: A Configurable Dataset for Natural Image Text Recognition". In CBDAR Workshop 2011 at ICDAR 2011. pp. 53‐58, September 2011
Date created : 2011
Comments:
this dataset consists of information such as texture, noise and inversion that are annotated manually. It also consists of other geometrical (e.g distortion, rotation) and typographical (typeface and language) characteristics.
this dataset can be used for text detection and recognition tasks.
the images in this dataset are annotated at word or phrase level that makes the number of character per bounding box to be higher
the dataset consists of 15 different languages that are written with Latin character
Quantitative numbers:
For datasets:
Volume = 1.3 GB compressed file
Number of examples: 659 real world images
Number of classes or labels: 5238 annotated bounding boxes and other metadata annotations
======================================================================================================================================================================
TBD.........
Common tasks and OCR datasets for benchmarking:
Text Image Preprocessing
SISR
TextZoom
Text Detection / Recognition
Synth-90K
Synth-Text
SVT
SVTP
ICDAR-2013
ICDAR-2015
ICDAR-2017
CUTE
RCTW
MLT-2017
MLT-2019
LSVT
ArG
SROIE
MSRA-TD-500
HUST-TR-400
CTW-1500
Total-Text
III5K
Reading Order Detection
ReadingBank
Table Reconstruction
PubTabNet
SynthTabNet
TableBank
Layout Analysis
CDLA
Table Bank
PubTabNet
Information Extraction from document
FUNSD
SROIE
XFUN
Document Visual Question Answer
Doc-VQA
ST-VQA