Scripts like Devanagari have complex mapping of Unicode character sequences to glyphs (and the same is true for older coding systems like ISCII). For example, the isolated characters representing the sequence "ha i na da i"
ह इ न ड इ
render as "hindi":
हिन्डि
(Note that the name of the language Hindi is actually written as
हिन्दी ; it's a different "d" sound,
द vs
ड, and it's a long "ii" sound.)
Similar
phenomena are widespread in most of the Indic scripts, as well as in
Urdu and Persian. In Latin script, there are few ligatures, and the
ones that actually exist (e.g., "et"/"&", "sz"/"ß", "ffl"/"
ffl")
have individual Unicode code points. CJK languages would have been
prime candidates for structural coding, but various cultural and
political obstacles prevented that, so CJK languages have tens of
thousands of Unicode code points instead.
As a result, you
cannot simply train on a Unicode transcription and expect it to work.
How you translate between Unicode and character classes in the
character recognizer affects how well the overall system works.
Here is an example of the kinds of decisions you may need to make:
- consider
the string हिन्डि ("hindi"); this string consists of the vowel "i", the
consonant "ka", the vowel "i", and the consonat "nda" (which is a
ligature of the consonants "na" and "da")
- let's assume that the segmenter returns at least four segments, the two halves of हि and the two halves of न्डि
- here
are examples of how this input might be transcribed under glyph
transcription systems that actually allow the character recognizer to
learn the shapes:
- हि न्डि ("hi ndi")
- it's probably the simpler and more reliable way of dealing with this
- segmenters
will probably almost always split the इ and the ह from each other, but
that is OK; this is also true in Latin script (e.g., "m" always also
gets split into "rn" and even "ni" and "rri")
- if you choose this encoding, you need to define your own code points for the glyph हि and the glyph न्डि (since Unicode doesn't do it for you) and use those as the classes for the classifier
- इ ह इ न्ड ("i ha i nda")
- the system learns separate character models for "i", "ha", and "nda".
- this reduces the number of models, but complicates the glyph-to-character correspondences
- here are transcriptions that likely won't work for training:
- ह इ न्ड इ ("ha i nda i")
- the transcription does not correspond to the actual linear arrangement of glyphs; in reading order, the pixels representing the vowel "i" really do come before the pixels denoting the consonant "ha".
- ह इ न ड इ
- if
your segmenter can't segment the glyph for न्ड into न and ड (and it
probably can't in general), then you can't train isolated shapes for it
- transcriptions
for training other kinds of vowels would be different, since it is the
glyph sequence as generated by the segmentation that matters
- होन्ड ("honda")
- good: हो न्ड ("ho nda") or ह ओ न्ड ("ha o nda")
- bad: ओ ह न्ड ("o ha nda"), since, unlike "i", for "o", the glyph follows the consonant
- हेन्ड ("henda")
- good: हे न्ड ("he nda") or ह न्ड ("ha nda", putting the diacritic back later)
- bad: ह ए न्ड ("ha e nda"), since there is no separate glyph corresponding to the "e" vowel (only a diacritic)
- there
is an additional complication for Indic scripts: there is no single
code point for न्ड in Unicode, so the character recognizer can't be
trained on that as a class
- right now, your best bet is to
map the Unicode ground truth into the ASCII range somehow yourself,
assigning codepoints to ligatures, and then train that way
- in the future, we'll have a cleaner, more general solution
These
issues are all connected, since which primitives you use in the
transcription affects what the charatcter recognizer has to learn.
If you're faced with a complex script like Devanagari, the recommendation is:
- first, try to figure out how well different segmenters work and what glyphs they give you
- try to estimate the size of the glyph set and the frequency of each glyph for the segmenters
- if
the set of common glyphs is greater than about 200, then you may have
to modify the glyph recognizer (not just train it) to deal with large
sets of glyphs (this is a modification we plan on making standard down
the road)
- try to see what kinds of error rates you get for
your glyphs if you train the isolated character recognizer on isolated
input characters (there may be a pre-segmented data set you can use, or
you can generate one); you can do most of the tuning on the isolated
character recognizer
- change the number of classes
- change the resolution of the input feature map
- change the feature types
- etc.
- only then worry about putting everything together into a line recognizer
For Indic scripts, the best solution is probably to treat each syllable (consonant cluster, vowel, other diacritics) as a single glyph and train on those. This will result in having thousands of distinct glyphs, but otherwise keep the system pretty simple.
From the mailing list (17 Jul 2008):
These issues are discussed at length in the documentation, under Indic
scripts:
http://sites.google.com/site/ocropus/documentation/indic-scripts
http://sites.google.com/site/ocropus/documentation/training-on-a-new-...
We analyzed the source code in grouping.cc and find that in the
function *addTrainingLine(intarray
> &trueseg, bytearray &image, nustring &chars) and addTrainingChar(subimage,
> char_text)* the classifier is accepting only one character at a time. So,
> there is nothing to do by us at this point. However it is very much
> necessary for us at this moment to train the entire character set in order
> to observe the performance of OCRopus for the Indic scripts. So, I think
> there should be a new version of the training procedure for scripts like
> Bangla, Devanagari and other scripts which have the same property.
Unfortunately, it's not that simple. There are many possible mappings
between Unicode strings and Devanagari glyphs. The addTrainingLine method
doesn't know which one you want.
For example, in many fonts, there is no separate "nda" glyph, there is a
generic "attached n" glyph (different from the "na" glyph) then gets
attached to a "da" for forming "nda". That might be a resonable choice for
OCR, or you might choose to actually represent "nda" as a separate glyph.
For "di", the glyph sequence might be "i d" or it might be "skip-bar d
diacritic-i", or even something yet different.
You need to write a function that maps regular Unicode into strings that
have one codepoint per glyph (that is, the kind of glyph you have chosen to
represent characters), and another function that inverts this. Then, you
can map all your input strings into this one-codepoint-per-glyph
representation, train OCRopus using those strings, and at the end map
OCRopus output back to standard Unicode.
Hope this issue will be considered as an important issue to solve
> immediately.
Well, as the wiki pages show, we have thought about it and there really
isn't much we need to add at this point; for now,
- implement devanagari_to_glyphstring and glyphstring_to_devanagari
functions and bind them to Lua
- modify the top level Lua training and recognition scripts to call those
functions at the right times
Once that code exists, then we'll add some hooks so that the to_glyphstring
and from_glyphstring functions get invoked at the right times inside
OCRopus, but that's just a small convenience for maintainers (it means we
don't need a separate top level loop for every language); you don't need
those for what you want to do.
Note that if you train bpnet on all Devanagari consonant clusters, it will
probably do something reasonable with the glyphs, but it will be slow; there
will be additional work afterwards tuning and improving the bpnet code to
work on such glyph sets.
Cheers,
Thomas.