SDI

The Scanned Document Indexing is a system that indexes document scanned in order to be digitally archived and available to the user. Most of these document are old printed documents (such as books, news paper or office documents) that are scanned and stored in different image format. Many of these documents are stored in PDF format but still their content is a image representation of the original paper document. The aim of the project is to extrapolate the text of the document using OCR algorithms, clean the document from errors of miss-interpretation of text and finally index the text. By indexing the text content we enable a fast search over the collection of documents. The user might consult directly the original scanned document (the copy of the original document) in order to avoid mistakes while extracting the text content.