AutoTag
Code and information to implement AutoTag in Matlab
Based on theory and code by Dr Nishatul Majid
"Character spotting and autonomous tagging: offline handwriting recognition for Bangla, Korean and other alphabetic scripts," Nishatul Majid, Elisa H Barney Smith, December 2022, International Journal on Document Analysis and Recognition (IJDAR), Springer Berlin Heidelberg, Volume 25, Issue 4, Pages 245-263.
Designed to be used to develop training sets for Character Spotting:
"Segmentation-free Bangla offline handwriting recognition using sequential detection of characters and diacritics with a Faster R-CNN," Nishatul Majid, Elisa H Barney Smith, September 2020, 2019 International Conference on Document Analysis and Recognition (ICDAR), IEEE, Pages 228-233.
Summary:
For each image file
Read page/word image and corresponding list of text words
Find locations of each word in the page/word image (word segmentation)
For each word
Split into subWord components (these will be letters in English or other European scripts, or conjunct groups with a diacritic in Indic scripts, individual Jamos in Korean)
Estimate relative width of each component character (get_string_widths & strPixelWidthCalculator)
Convert into bounding boxes (get_letter_positions)
Add buffers (get_letter_positions)
Determine component code index number
add new codewords as necessary
Save
Calculate Letter Widths
strPixelWidthCalculator
Display the character text in an image and measure its width
Pass in each letter group for a word. Normalize
Calculate Bounding Box Positions
get_letter_positions
Convert those widths plus the bounding box information for the word into letter bounding boxes, both tight and buffered.