Handwritten Document Retrieval Strategies

Venu Govindaraju, Distinguished Professor of Computer Science and Engineering, The State University of New York, Buffalo.

With the continuous growth of the World Wide Web, there is an urgent need for an efficient information retrieval system which can search and retrieve handwritten documents when presented with user queries. However, unconstrained handwriting recognition remains a challenging task with inadequate performance (around 30%, accuracy) thus proving to be a major hurdle in providing a robust search experience in the domain of handwritten documents. In this paper, we describe our recent research with focus on information retrieval from noisy text output by imperfect recognizers applied to handwritten document images. We describe three techniques each exploring a different approach for solving the noisy text retrieval task. The first method uses a novel bootstrapping mechanism to refine the OCR’ed text and uses the cleaned text for retrieval. The second method uses the uncorrected or raw OCR’ed text but modifies the standard vector space model for handling noisy text issues. The third method employs robust image features to index the documents instead of using noisy OCR’ed text. We describe these approaches in detail and also present their performance using standard IR evaluation metrics.


About Venu Govindaraju

Dr. Venu Govindaraju is a Distinguished Professor of Computer Science and Engineering at the University at Buffalo (SUNY Buffalo). He received his B-Tech (Honors) from the Indian Institute of Technology (IIT), Kharagpur, India in 1986, and his Ph.D. from UB in 1992. Dr. Govindaraju has authored more than 300 scientific papers including 60 journal papers. His seminal work in handwriting recognition was at the core of the first handwritten address interpretation system used by the US Postal Service. He was also the prime technical lead responsible for technology transfer to Lockheed Martin and Siemens Corporation for deployment by the US Postal Service, Australia Post and UK Royal Mail. Dr. Govindaraju has been the Principal or Co-principal investigator of projects funded by government and industry for about 55 million dollars.The Center for Unified Biometrics and Sensors (CUBS) that he founded in 2003 has since received about 10 million dollars of research funding covering several projects in biometrics, security, document recognition, and retrieval. Dr. Govindaraju has given over 80 invited talks and has supervised the dissertation of 18 doctoral students. He has served on the editorial boards of premier journals in his area and has chaired several technical conferences and workshops. He has won several awards for his scholarship including the prestigious MIT Global Technovator Award (2004) and the HP Open Innovation Award (2008). He is a Fellow of the IEEE (Institute of Electrical and Electronics Engineers), a Fellow of the IAPR (International Association of Pattern Recognition) and a fellow of the IETE (Institution of Electronics and Telecommunications Engineers).