Common IssuesThis wiki has some general information on how to handle Indic Scripts.Generally, recognition of Indic scripts is complicated because...
Sanskrit in modern Indian scripts. May Śiva bless those who take delight in the language of the gods. (Kalidasa) Transliteration ToolsIndic scripts are usually typed in sort-of phonetically using Latin script and then transliterated. This can be done by the operating system (the SCIM subsystem on Linux), by the application, or over the web. There are also multiple different systems, and some transliteration tools attempt to be smart.This Wikipedia article contains a list of on-line transliteration resources for Indic scripts: http://en.wikipedia.org/wiki/Devanagari_transliteration These are on-line tools that perform transliteration right in your browser: http://www.google.com/transliterate/indic http://www.iit.edu/~laksvij/language/sanskrit.html http://quillpad.in/hindi/ Character SegmentationA prominent feature of many (but not all) Indic scripts is a line that connects all the characters. The OCRopus curved cut segmenter seems to work well in the presence of these lines.However, another approach for dealing with these lines is to clip them between characters, transforming the problem into a problem similar to other scripts. See here for sample code. If we wanted to use clipping as the approach, we would probably re-implement that. Here is an example of Devanagari segmentation; the top is the input image, the bottom is a color representation of the segmentation. It's an oversegmentation, which is OK: that's what the OCRopus recognizer expects: Devanagari (Hindi, Sanskrit)Pavan Kumar N[Btech, NITK Surathkal]Bengali
Note: see the pages on Indic Scripts as well. There is also some information on training Tesseract. One approach to segmenting scripts like Bengali is to cut the "matras" to achieve isolated characters (see here), although the regular OCRopus curved cut segmenter appears to be working well as well. |


