Document Image Binarization

Document Image Binarization

Participants

Funding

Description

Often documents are poorly illuminated when they are scanned or have yellowed with aged causing an uneven background color. To convert the image into a text document, the image is passed through an Optical Character Recognition (OCR) algorithm. Most OCR algorithms process only input images that are black and white, without intermediate gray levels. Therefore the image must be thresholded. The simplest thresholding algorithm is a global threshold. That doesn’t work well on images with varying background content.

Adaptive thresholding algorithms can work around this, but often cause the background to have a peppered texture.

We are working to improve a common adaptive thresholding algorithm by Niblack, to overcome this problem. Preliminary results are promising.

Publications