Old OCRopus Wiki

Languages‎ > ‎

Indic Scripts (Devanagari, Bengali, Kannada, etc.)

Common Issues

This wiki has some general information on how to handle Indic Scripts.

Generally, recognition of Indic scripts is complicated because...
  • Indic scripts treat vowels as diacritics, giving rise to large numbers of glyphs representing consonant/vowel combinations; these either need to be analyzed or trained as a whole.
  • Many Indic scripts have hundreds (sometimes thousands) of ligatures representing consonant clusters; these are not analyzable in general and need to be recognized as different glyphs.
  • Indic scripts are coded using structural coding; sequences of codepoints in Unicode combine into glyphs.
  • Indic scripts tend not to indicate word breaks.
  • Some Indic scripts connect characters with a single line, making connected component analysis fail.
Here are examples of common Indic scripts (from Wikipedia):


Sanskrit in modern Indian scripts. May Śiva bless those who take delight in the language of the gods. (Kalidasa)

Transliteration Tools

Indic scripts are usually typed in sort-of phonetically using Latin script and then transliterated.  This can be done by the operating system (the SCIM subsystem on Linux), by the application, or over the web.  There are also multiple different systems, and some transliteration tools attempt to be smart.

This Wikipedia article contains a list of on-line transliteration resources for Indic scripts:

http://en.wikipedia.org/wiki/Devanagari_transliteration

These are on-line tools that perform transliteration right in your browser:

http://www.google.com/transliterate/indic
http://www.iit.edu/~laksvij/language/sanskrit.html
http://quillpad.in/hindi/

Character Segmentation

A prominent feature of many (but not all) Indic scripts is a line that connects all the characters. The OCRopus curved cut segmenter seems to work well in the presence of these lines.

However, another approach for dealing with these lines is to clip them between characters, transforming the problem into a problem similar to other scripts. See here for sample code. If we wanted to use clipping as the approach, we would probably re-implement that.

Here is an example of Devanagari segmentation; the top is the input image, the bottom is a color representation of the segmentation.  It's an oversegmentation, which is OK: that's what the OCRopus recognizer expects:



Devanagari (Hindi, Sanskrit)

Pavan Kumar N[Btech, NITK Surathkal]

Bengali

Note: see the pages on Indic Scripts as well. There is also some information on training Tesseract. One approach to segmenting scripts like Bengali is to cut the "matras" to achieve isolated characters (see here), although the regular OCRopus curved cut segmenter appears to be working well as well.

BOCRA Open Source OCR

There is a sourceforge project BOCRA which sort of sometimes works, but has been neglected for some time. It has an interface for training which might be useful (and important for Bengali because fonts are not very standard).  It is unclear whether its recognition approach is any good.

BanglaOCR Open Source HMM-Based OCR

BanglaOCR, an HMM-based OCR system under the GPL by the Center for Research on Bangla Language Processing.

Commercial Bengali OCR

None known. There are several groups which claim to have working implementations, but none that are actually available, commercially or otherwise. 

Ground-Truthed Bengali OCR Data (Scans + Transcription)

The Indian Statistical Institute (which claims to have an OCR system they plan to commercialize) has a project that plans to provide some data, but nothing is available on the website yet.

Bengali Dictionaries and Text Corpora (for Statistical Language Modeling)

Basic word lists are available (e.g. from http://www.bengalinux.org). There is also a website similar to Project Gutenberg (maintained by the author of BOCRA), which has some material.

Kannada

I am newbie - 74 yrs old.  I like to know about OCR for Kannada( similar to Bengali) one of the Indian Languages  How I can help to assist to develop OCR for Kannada. 

Telugu