The data for this project consists mainly of handwritten characters. We did not use purely digital images; rather, we took images of the physical copies (printed or otherwise) to introduce noise and other imperfections to the text. An example of one of these images is below. In addition, our module requires that we have a standard data set that we utilize as reference for identifying characters. The reference is taken from an online standardized set, in order to simplify processing and introduce consistency.
Handwritten sample data, used for testing, written at a reasonable size for identification.
We noticed early on in our testing that to process close and/or dense text, significantly higher resolution images were required, as segmenting with bounding boxes proved to be inaccurate with the small, low-resolution characters. Thus, we opted to use bigger, distinct lettering for tests, as it created more definitive results that can show proof of concept and accuracy. In addition, for simplicity and to decrease errors, we chose to limit our processor to capital letter identification.