A System to Transcribe Documents in European Languages with Human Help

We worked on a system intended to transcribe words in an image that contains Unicode characters to text. Such a system will be useful in transcribing scanned documents from many European languages like Spanish, German, French etc. European languages have existed for a long time and have a rich tradition, culture and a corresponding literature. In addition, there are many books in various fields of science, technology and engineering produced and continues to be produced in these languages. Digitizing and transcribing these books will provide easy access to knowledge and information which otherwise can only be accessed using the physical copy of the book and hence limited to a privileged few. 

The first step was obtaining the images of the words from the scanned documents from many European languages by adaptive thresholding, morphological segmentation and projection using Matlab. The other steps involve storing the scanned words in a database and then presenting the words for transcription. The website is available at This research paper is published in the October 2011 edition of journal of Image Processing, Computer Vision and Pattern Recognition (IPCV). This is joint work with Dr. Chityala, Minnesota Supercomputing Institute, University of Minnesota. 

This paper can be found at the following link

Method to extract names of geographical features from images of maps 

There are thousands of hand drawn maps of Antarctica waiting to be digitized. These maps contain various first-order geographical features like coasts, seas, plateaus, glaciers etc. Due to the extent of these features, their names are spelled over a large area of the map along a contour. A human can perceptually understand that all these letters belong to a word but it is challenging to make the computers perform the same. If the individual letters can be combined to form a word, it can then be transcribed using an Optical Character Recognition (OCR) program, so that text-based queries can be performed. In this paper, we have applied the idea of perceptual grouping to simulated images containing text with letters separated by large distance. The centroid of each of the letter is determined. Finally, the centroids are then grouped based on their proximity to other centroids and also the direction between the pairs of centroid. Once the letters are grouped, the images of the individual letters were given to OCR after appropriate transformation. We applied the technique on many simulated images containing a total of 12 words with 66 letters. The grouping process determined accurately 65 of the 66 letters and assembled them correctly in to a word, giving a close to 100% accuracy.

This is the abstract of my talk at American Mathematical Society National Meetings at Boston, 2012. It can also be found at