AutoTag

Code and information to implement AutoTag in Matlab

Based on theory and code by Dr Nishatul Majid

"Character spotting and autonomous tagging: offline handwriting recognition for Bangla, Korean and other alphabetic scripts," Nishatul Majid, Elisa H Barney Smith, December 2022, International Journal on Document Analysis and Recognition (IJDAR), Springer Berlin Heidelberg, Volume 25, Issue 4, Pages 245-263.

Designed to be used to develop training sets for Character Spotting:

"Segmentation-free Bangla offline handwriting recognition using sequential detection of characters and diacritics with a Faster R-CNN," Nishatul Majid, Elisa H Barney Smith, September 2020, 2019 International Conference on Document Analysis and Recognition (ICDAR), IEEE, Pages 228-233.

Summary:

For each image file

   Read page/word image and corresponding list of text words

   Find locations of each word in the page/word image (word segmentation)

   For each word

     Split into subWord components (these will be letters in English or other European scripts, or conjunct groups with a diacritic in Indic scripts, individual Jamos in Korean)

     Estimate relative width of each component character (get_string_widths & strPixelWidthCalculator)

     Convert into bounding boxes (get_letter_positions)

     Add buffers (get_letter_positions)

     Determine component code index number

        add new codewords as necessary

   Save

Calculate Letter Widths

strPixelWidthCalculator

strPixelWidthCalculator.m

Display the character text in an image and measure its width

get_string_widths.m

Pass in each letter group for a word. Normalize

Calculate Bounding Box Positions

get_letter_positions

get_letter_positions.m

Convert those widths plus the bounding box  information for the word into letter bounding boxes, both tight and buffered.