Old OCRopus Wiki

Old Documentation‎ > ‎Documentation‎ > ‎

Indic Scripts (Complex Scripts with Structural Coding)

Scripts like Devanagari have complex mapping of Unicode character sequences to glyphs (and the same is true for older coding systems like ISCII). For example, the isolated characters representing the sequence "ha i na da i"

ह इ न ड इ

render as "hindi":

हिन्डि

(Note that the name of the language Hindi is actually written as हिन्दी ; it's a different "d" sound, vs , and it's a long "ii" sound.)

Similar phenomena are widespread in most of the Indic scripts, as well as in Urdu and Persian. In Latin script, there are few ligatures, and the ones that actually exist (e.g., "et"/"&", "sz"/"ß", "ffl"/"") have individual Unicode code points. CJK languages would have been prime candidates for structural coding, but various cultural and political obstacles prevented that, so CJK languages have tens of thousands of Unicode code points instead.

As a result, you cannot simply train on a Unicode transcription and expect it to work.   How you translate between Unicode and character classes in the character recognizer affects how well the overall system works.

Here is an example of the kinds of decisions you may need to make:
  • consider the string हिन्डि ("hindi"); this string consists of the vowel "i", the consonant "ka", the vowel "i", and the consonat "nda" (which is a ligature of the consonants "na" and "da")
  • let's assume that the segmenter returns at least four segments, the two halves of हि and the two halves of न्डि
  • here are examples of how this input might be transcribed under glyph transcription systems that actually allow the character recognizer to learn the shapes:
    • हि न्डि ("hi ndi")
      • it's probably the simpler and more reliable way of dealing with this
      • segmenters will probably almost always split the इ and the ह from each other, but that is OK; this is also true in Latin script (e.g., "m" always also gets split into "rn" and even "ni" and "rri")
      • if you choose this encoding, you need to define your own code points for the glyph हि and the glyph न्डि (since Unicode doesn't do it for you) and use those as the classes for the classifier
    • इ ह इ न्ड ("i ha i nda")
      • the system learns separate character models for "i", "ha", and "nda".
      • this reduces the number of models, but complicates the glyph-to-character correspondences
  • here are transcriptions that likely won't work for training:
    • ह इ न्ड इ ("ha i nda i")
      • the transcription does not correspond to the actual linear arrangement of glyphs; in reading order, the pixels representing the vowel "i" really do come before the pixels denoting the consonant "ha".
    • ह इ न ड इ
      • if your segmenter can't segment the glyph for न्ड into न and ड (and it probably can't in general), then you can't train isolated shapes for it
  • transcriptions for training other kinds of vowels would be different, since it is the glyph sequence as generated by the segmentation that matters
    • होन्ड ("honda") 
      • good: हो न्ड ("ho nda") or ह ओ न्ड ("ha o nda")
      • bad: ओ ह न्ड ("o ha nda"), since, unlike "i", for "o", the glyph follows the consonant
    • हेन्ड ("henda")
      • good:  हे न्ड ("he nda") or ह न्ड ("ha nda", putting the diacritic back later)
      • bad: ह ए न्ड ("ha e nda"), since there is no separate glyph corresponding to the "e" vowel (only a diacritic)
  • there is an additional complication for Indic scripts: there is no single code point for न्ड in Unicode, so the character recognizer can't be trained on that as a class
    • right now, your best bet is to map the Unicode ground truth into the ASCII range somehow yourself, assigning codepoints to ligatures, and then train that way
    • in the future, we'll have a cleaner, more general solution
These issues are all connected, since which primitives you use in the transcription affects what the charatcter recognizer has to learn.

If you're faced with a complex script like Devanagari, the recommendation is:
  • first, try to figure out how well different segmenters work and what glyphs they give you
  • try to estimate the size of the glyph set and the frequency of each glyph for the segmenters
    • if the set of common glyphs is greater than about 200, then you may have to modify the glyph recognizer (not just train it) to deal with large sets of glyphs (this is a modification we plan on making standard down the road)
  • try to see what kinds of error rates you get for your glyphs if you train the isolated character recognizer on isolated input characters (there may be a pre-segmented data set you can use, or you can generate one); you can do most of the tuning on the isolated character recognizer
    • change the number of classes
    • change the resolution of the input feature map
    • change the feature types
    • etc.
  • only then worry about putting everything together into a line recognizer
For Indic scripts, the best solution is probably to treat each syllable (consonant cluster, vowel, other diacritics) as a single glyph and train on those.  This will result in having thousands of distinct glyphs, but otherwise keep the system pretty simple.

Comments (1)

Thomas Breuel - Jul 18, 2008 4:09 AM

From the mailing list (17 Jul 2008):

These issues are discussed at length in the documentation, under Indic
scripts:

http://sites.google.com/site/ocropus/documentation/indic-scripts

http://sites.google.com/site/ocropus/documentation/training-on-a-new-...

We analyzed the source code in grouping.cc and find that in the
function *addTrainingLine(intarray

> &trueseg, bytearray &image, nustring &chars) and addTrainingChar(subimage,
> char_text)* the classifier is accepting only one character at a time. So,
> there is nothing to do by us at this point. However it is very much
> necessary for us at this moment to train the entire character set in order
> to observe the performance of OCRopus for the Indic scripts. So, I think
> there should be a new version of the training procedure for scripts like
> Bangla, Devanagari and other scripts which have the same property.

Unfortunately, it's not that simple. There are many possible mappings
between Unicode strings and Devanagari glyphs. The addTrainingLine method
doesn't know which one you want.

For example, in many fonts, there is no separate "nda" glyph, there is a
generic "attached n" glyph (different from the "na" glyph) then gets
attached to a "da" for forming "nda". That might be a resonable choice for
OCR, or you might choose to actually represent "nda" as a separate glyph.
For "di", the glyph sequence might be "i d" or it might be "skip-bar d
diacritic-i", or even something yet different.

You need to write a function that maps regular Unicode into strings that
have one codepoint per glyph (that is, the kind of glyph you have chosen to
represent characters), and another function that inverts this. Then, you
can map all your input strings into this one-codepoint-per-glyph
representation, train OCRopus using those strings, and at the end map
OCRopus output back to standard Unicode.

Hope this issue will be considered as an important issue to solve

> immediately.

Well, as the wiki pages show, we have thought about it and there really
isn't much we need to add at this point; for now,

- implement devanagari_to_glyphstring and glyphstring_to_devanagari
functions and bind them to Lua
- modify the top level Lua training and recognition scripts to call those
functions at the right times

Once that code exists, then we'll add some hooks so that the to_glyphstring
and from_glyphstring functions get invoked at the right times inside
OCRopus, but that's just a small convenience for maintainers (it means we
don't need a separate top level loop for every language); you don't need
those for what you want to do.

Note that if you train bpnet on all Devanagari consonant clusters, it will
probably do something reasonable with the glyphs, but it will be slow; there
will be additional work afterwards tuning and improving the bpnet code to
work on such glyph sets.

Cheers,
Thomas.