Document Degradation Modeling

Document Degradation Modeling

Participants

  • Faculty: Dr. Elisa Barney Smith

  • Students: Margaret Norris, Johnny Hok Sum Yam, Craig McGillivary, Chris Hale, Darrin Reed, Roger Clement, Subramaniam Venkatraman, Xiaohui Qiu

Funding

  • This project was supported by the National Science Foundation under CAREER grant No. CCR-0238285.

  • It also received funding from NSF-EPSCoR and a BSU Faculty Research Grant.

Description

Dr. Barney’s PhD dissertation focused on modeling the imaging process of a desktop document scanner and evaluating how that produced degradations in bilevel document images. Much of her early work expanded on this topic. To improve the performance of DIA, four major themes were investigated:

  • Model the nonlinear systems of printing, scanning, photocopying and FAXing, and multiple combinations of these, that produce degraded images, and develop methods to calibrate these models. From a calibrated model one can predict how a document will look after being subjected to these processes. This can be used to develop products that degrade text images less.

  • Statistically validate these models. This will give other researchers the confidence to use these models to create large training sets of synthetic characters, with which they can conduct controlled DIA and OCR experiments.
    Estimate the parameters to these models from a short character string to allow continuous calibration to account for spatially-variant systems.

  • OCR Training: Determine how these models and parameters can best be used to improve OCR accuracy by partitioning the training set based on modeled degradations and matching the appropriate partition to the test data at hand.

  • Filter: Improve the image quality by selecting a filter based on the degradations that are present and the process that caused that degradation.

MODEL DEVELOPMENT

Scanning – A model for the degradation caused by scanning has been developed. It consists primarily of a variable for the optics and a variable for the threshold level plus additive noise. Each combination of parameters will affect the resulting character differently:

Along with the modeling, we have developed a few methods for model parameter estimation from bilevel images. This will enable us to answer the question “Which degradation is accurate for a given scanner?” We are working on how we can use this to improve scanner development or recognition of the printed characters. While the degradations from scanners represent only one facet of document image degradation sources, it is the most accessible. Once estimation of scanning parameters can be done accurately and efficiently, then other types of degradations can be accessed.


Noise Effects – Noise has an effect on images, but quantifying that effect is the focus of this project. Additive noise will affect a bilevel image differently depending on the nature of the gray level image to which the noise is added. If we assume the gray level image is formed from blurring a high contrast image, as done in the scanning process, then after noise is added the image is thresholded, the blur width and the binarization threshold have a large effect on how many pixels are affected and on how far from an edge pixels are affect. This in turn can affect the ability to fit a line or shape to an edge. We have defined a metric called Noise Spread to capture this effect.

Integration of Printer & Scanner Models (Photocopying and FAXing)

A photocopier is basically a scanner coupled with a printer, and a FAX machine is basically a low resolution copy machine. When the printer and scanner models above are merged, models of copiers and FAX machines can be developed.

Printing and scanning are the building block processes for document generation. OCR analysis is done on digitized images requiring all documents be scanned. Thus the scanning degradations are the easiest to isolate and were the first to be looked at. Prior work of Dr. Barney Smith has generated new understanding of the relationship between scanner system parameters and document image degradations (stroke width and corner erosion) as well as a collection of tools on scanning models. Printing has been studied in the field of halftoning to decide what to print to get a desired grey level. These two fields will be combined for use on bilevel images common in Document Image Analysis.

The printer model must be combined with the existing model of scanning. To simplify this combination of models, we want to see whether the nonlinear printing model can be approximated by a linear two-dimensional convolution. If so, then the kernel for the convolution in the scanning model can be combined with the kernel for the convolution in the scanning model to make a single print/scan kernel. Doing this would enable all the methods that exist for the scanning model to be maintained. This is one possible route to the next goal of developing defect models that incorporate both printing and scanning sub-system models.

Methods have been developed to calibrate the scanner defect model without extensive equipment, predominantly using information in text images as opposed to specialized test charts. These need to be expanded to the combined print/scan model.

Models are only useful to the scientific community when they are validated. A method of validating defect models was proposed by Kanungo. This method has been used by Dr. Barney Smith in research done with Dr. Qui. Code to make this flexible and to try some other experimental possibilities is currently being developed by a pair of undergraduate research students. Using this procedure to validate of each of these models is the final part of this work.

Human Comparative Studies: We have proposed that the amount of Edge Spread in a document is a good metric for how degraded the document appears. This is the foundation of the “Training Methods” project described above. We also want to see if this or other parameters of our model are correlated with how humans rank the amount of image quality, or lack there of.

VALIDATE THE MODEL

Statistical Validation – Statistical validation of these models is a very important component of developing a model. Code has been written to make a flexible platform through which these validation experiments can be run. This includes configuration to run under a grid computing framework. Validation will consider model choice of PSF and we hope to compare our model with other degradation models. Currently source data is being collected to enable us to run these experiments.

MODEL PARAMETER ESTIMATION

Estimation of parameters to the scanner defect model can be done with features available in common textual images. We have selected several characters that are suitable for this estimation, and have estimated how many of them are needed to produce a ‘good’ estimate. A method to determine when the measurements are changing enough to indicate that the model has changed either page to page or within a given page is still needed.

OCR TRAINING PROCEDURES

Training Methods – In a paper by Barney Smith and Qiu, regions of the degradation space were found where characters are statistically similar by multiple metrics. Training sets are currently being prepared in these regions and evaluation of the effect on OCR will follow.

The most prevalent method of improving individual character recognition is to train the classifier with as large of a training set as possible. Recognition accuracies can also be increased by matching the training set to the document. This has been done by extracting templates from the document in one step and then using them in recognition in a second pass through the document. We propose to use the model calibration developed previously to combine these two methods providing a training set that is both large and matched to the OCR process.

There are several benchmark datasets available for testing OCR systems, and OCR companies have their own large datasets. When a page can be analyzed to characterize the degradation’s relationship to the model, the database can be subdivided into large sets of matched characters.

For good recognition accuracy a large training set and a training set closely matched to the test set are needed. This work proposes as a goal to develop the framework to also match the degradation level of the document to a large training set by developing methods to partition a large training set and to use the model calibration to select the appropriate training set.

We will determine how to divide the model and parameter spaces to capture the difference in the image degradation. At the same time we want to limit the number of partitions to keep management simple and to maintain the ability to generalize. The partitions should be set such that an error in model calibration will not often point to a different template set. We will determine a method to partition the degradation space for classifier training as one focus of this proposed research. Work by Dr. Barney Smith showed that characters will have a similar appearance as quantified by the Hamming distance when degraded with combinations of degradation parameters yielding the same edge displacement degradation feature so long as the difference in PSF width was not very large. It is expected that subdividing the space in regions of common edge displacement will work better than using a Cartesian division. Other metrics of similarity will be examined also.

Ho & Baird compared how a classifier trained on a single font (25 phases, 125 (5x5x5) degradations, 94 char classes) degraded over the whole degradation space responds to samples at each point in the degradation space. This showed under which model parameters characters are difficult to recognize when the classifier is trained on a global training set. We will train a family of classifiers each on characters degraded with parameters from a different subset of the degradation space. Then evaluate the classification results for each classifier over the whole space.

We will compare recognition accuracies both with and without this partitioning method. This will be done with both a spatially invariant model and using the adaptive parameter variation. With OCR accuracy on highly degraded document images at 92%, there is need for improvement. A 1% improvement in recognition accuracy on a typical page of 2500 characters will remove 25 errors.

SELECTING DOCUMENT IMAGE ENHANCEMENT FILTERS BASED ON DEGRADATION MODEL PARAMETERS

Images are degraded by a number of different mechanisms. If those mechanisms are primarily dictated by a common degradation model that contains blurring (convolution), sampling, additive noise and thresholding, and if the parameters to that degradation model are known, then the choice of appropriate filter should be easier to determine.

Past efforts to chose restoration filters have looked at characteristics of the image, and how the filters have improved recognition. They have not considered the mathematical source of the degradations.

The restored image is not likely to be the original image due to the non-linear nature of the degradation. The criteria for determining the best output image, and thus the best filter could be to compare the final output with the input, but other metrics such as recognition could be used.

PUBLICATIONS

  • E. H. Barney Smith and X. Qiu, “Statistical Image Differences, Degradation Features and Character Distance Metrics” International Journal of Document Analysis and Recognition, Springer Verlag, Vol.6, No. 3, 2004, pp. 146-153 (special issue, invited paper).

  • E. H. Barney Smith, “Characterization of Image Degradation Caused by Scanning,” Pattern Recognition Letters, Vol. 19, No. 13, November 1998, pp. 1191-1197.

  • Craig McGillivary, Chris Hale, and Elisa H. Barney Smith, “Noise Effects in Bilevel Document Images,” 3rd Workshop on Analytics for Noisy Unstructured Text Data (AND-09), Barcelona, Spain, 23-24 July 2009, pp.17-24.

  • Elisa H. Barney Smith, “Modeling Image Degradations for Improving OCR”, (Invited paper), Proc. 16th European Signal Processing Conference, Lausanne, Switzerland, 25-29 August 2008.

  • D. K. Reed and E. H. Barney Smith, “Correlating degradation models and image quality metrics,” Proc. SPIE Electronic Imaging, Document Recognition and Retrieval XV, Vol. 6815, San Jose, CA, January 2008, paper # 681508.

  • Chris Hale and Elisa H. Barney Smith, “Human Image Preference and Document Degradation Models,” International Conference on Document Analysis and Recognition 2007, Curitiba, Brazil, September 2007, pp. 250-254.

  • E. H. Barney Smith, “PSF estimation by gradient descent fit to the ESF,” Proc. SPIE Electronic Imaging, Image Quality and System Performance III, Vol. 6059, San Jose, CA, January 2006, paper #605914.

  • E. H. Barney Smith and T. Andersen, “Partitioning of the degradation space for OCR training,” Proc. SPIE Electronic Imaging, Document Recognition and Retrieval XIII, Vol. 6067, San Jose, CA, January 2006, paper #606705.

  • E. H. Barney Smith and T. Andersen, “Text Degradations and OCR Training,” International Conference on Document Analysis and Recognition 2005, Seoul, Korea, August 2005, pp. 834-838.

  • H.S. Yam and E. H. Barney Smith, “Estimating Degradation Model Parameters from Character Images,” International Conference on Document Analysis and Recognition 2003, Edinburgh, Scotland, 3-6 August 2003, pp. 710-714.

  • R. Clements and E. H. Barney Smith, “Speedup of Optical Scanner Characterization Subsystem,” Proc. SPIE Electronic Imaging, Document Recognition and Retrieval X, Vol. 5010, Santa Clara, CA, January 2003, pp. 94-102.

  • E. H. Barney Smith and X. Qiu, “Relating statistical image differences and degradation features,” Proc. 5th International Workshop, Document Analysis Symposium 2002, Princeton, NJ, Springer Verlag LNCS 2423, August 2002, pp. 1-12.

  • E. H. Barney Smith, “Uniqueness of bilevel image degradations,” Proc. SPIE Photonics West, Document Recognition and Retrieval IX, Vol. 4670, San Jose, CA, January 2002 , pp. 174-180.

  • E. H. Barney Smith, “Scanner Parameter Estimation Using Bilevel Scans of Star Charts,” International Conference on Document Analysis and Recognition 2001, Seattle, WA, September 2001, pp. 1164-1168.

  • E. H. Barney Smith, “Bilevel Image Degradations: Effects and Estimation,” 2001 Symposium on Document Image Understanding Technology, Columbia, MD, 23-25 April 2001, pp. 49-55.

  • E. H. Barney Smith, “Estimating Scanning Characteristics from Corners in Bilevel Images,” Proc. SPIE Document Recognition and Retrieval VIII, San Jose, CA, 21-26 January 2001, pp.176-183.

  • Craig McGillivary, “Quantifying Noise Effects in Bilevel Document Images“, Masters thesis, Electrical Engineering, Boise State University, December 2007.

  • Subramaniam Venkatraman, “Degradation Specific OCR,” Masters thesis, Electrical Engineering, Boise State University, December 2010.

  • Hok Sum (Johnny) Yam, “Estimating Degradation Model Parameters from Character Images“, Masters thesis, Electrical Engineering, Boise State University, December 2004.

  • Roger Clements, “Speedup of Optical Scanner Characterization System“, Masters thesis, Computer Engineering, Boise State University, May 2003.

Printer Model

PRINTER MODEL

Printing – Usually in the printing process, it is assumed that the characters will be printed with nice smooth boundaries as people think they see the characters. In reality printing processes such as laser printers and inkjet printers cause a spread in the amount of toner or ink around the image boundaries.

A model was developed by Yi at University of Idaho for the amount of electrostatic charge on the charge roller of a laser printer. A method to convert this charge density to a measure of the average amount of coverage toner will produce on a piece of paper has been completed as an MS thesis in this lab. The coverage is a function of the number of toner pieces available to be distributed, the size of the toner pieces and the laser trace pattern. Simulations visually match magnified samples and averages of simulated toner placement. Current work is confirming averages of printed samples match our expected average coverage by comparing model outputs to test samples printed on a printer with special control and that populations of individual printed samples are statistically similar to populations of samples generated by our model. The coverage will then be converted to expected reflectance and a qualitative measure of how the printing degradations affect images representing characters will follow.

Expanding the printer degradation model A printer degradation model has been developed. It is not as developed as the scanner model. Still open questions remain.

Including user effects in printer model – It has been observed that while the printer model does a very good job of representing the images on paper at a microscopic level, after the toner is attracted to the paper, during the fusing stage there is a directional effect on the image. Incorporating this into the model is an open problem.

Calibration of printer model – The printer model is a much more stochastic model than the scanner model. In the scanner model, the additive noise is stochastic, but the general form of the resulting image is not. For the printer, the toner that produces the image is adhered in a stochastic manner. The expected coverage of the paper by toner can be predicted, but not the sample image. Still the calibration can be done by looking at edge ragedness, fill density and the effective width of a stroke of a known number and spacing of laser traces. Taking these measurements, and correlating them to the output of the printer model parameters is an open project.

The parameters affecting the printer model currently include reflectance of the paper and ink, trace of the laser, spread of the toner, and size and quantity of toner particles. This study will also show what types of image degradations each parameter affects. The information about the relationship between the parameters and the degradations could benefit printer design.

To create a paper document, it must be printed. Printing is also the output process of photocopying and FAXing. The component of this proposed graduate project is development of a calibration method for the printing process suitable for use in Document Image Analysis and integration of the printing & scanning models.

For this project, only the electrophotographic printing process will be considered. The toner is applied to the paper in quantities related to the charge on the photoconductor. The charge is related to the laser intensity. A Masters Student graduating in May 2004, Margaret Norris, has developed a model of how toner is dispersed on paper that includes prior work on how laser intensity is related to source image shape. The exact amount of toner applied to the paper and the resulting absorptance level (darkness grey level) are probabilistic measures, therefore a probabilistic model has been developed. This model needs to be expanded to a broader class of input pictures. A method needs to be developed to calibrate this model based on samples of printed characters.

VALIDATE THE MODEL

Validation of printer model – The approach of compared comparing this model to other models through non-parametric statistical testing could be applied to validating the printer model. The validation can be done on the pixel level or on a more macro level with multi pixel strokes and other shapes.

Comparing Models: Features that often appear in degraded documents have been identified and are used by many researchers. These include estimating the amount of touching characters, small speckles, broken characters, etc. in a document. Several metrics to measure these factors exist. We are comparing how these factors compare to the parameters in our degradation model.

PUBLICATIONS

  • Elisa H. Barney Smith, “Chapter 1.2 – Document Creation, Image Acquisition and Document Quality,” in Handbook of Document Image Processing and Recognition, Eds. D. Doermann and K. Tombre, Springer-Verlag, ISBN 978-0-85729-858-4, June 2014.

  • Elisa H. Barney Smith, “Relating electrophotographic printing model and ISO13660 standard attributes,” Proc. Image Quality and System Performance VII, Vol. 7529, San Jose, CA, January 2010.

  • Elisa H. Barney Smith, Eric Maggard, Scott Line, Mark Shaw, “Quantifying Print Quality for Practice,” NIP & Digital Fabrication Conference, November 2015, pp. 157-162.

  • Edul N. Dalal, Elisa H. Barney Smith, Frans Gaykema, Allan Haley, Kerry Kirk, Don Kozak, Mark Robb, Tim Qian, Ming-Kai Tse, “INCITS W1.1 standards for perceptual evaluation of Text and Line Quality,” Proc. SPIE Electronic Imaging, Image Quality and System Performance VI, San Jose, CA, January 2009, paper #724203.

  • T. Bouk, E. N. Dalal, K. D. Donohue, S. Farnand, F. Gaykema, D. Gusev, A. Haley, P. L. Jeran, D. Kozak, W. C. Kress, O. Martinez, D. Mashtare, A. McCarthy, Y. S. Ng, D. R. Rasmussen, M. Robb, H. Shin, M. Q. Slickers, E. H. Barney Smith, M-K. Tse, D. Williams, E. Zeise, S. Zoltner, “Recent progress in the development of INCITS W1.1, Appearance-based image quality standards for printers,” Proc. SPIE Electronic Imaging, Image Quality and System Performance IV, Vol. 6494, San Jose, CA, January 2007, paper #64940K.

  • Margaret Norris, “Modeling of Toner Coverage in Laser Printers“, Masters thesis, Electrical Engineering, Boise State University, May 2004.

  • M. Norris and E. H. Barney Smith, “Printer Modeling for Document Imaging,” Proc. 2004 International Conference on Imaging Science, Systems, and Technology (CISST’04), Las Vegas, Nevada, USA, June 21-24, 2004, pp. 14-20.

Scanner Model


Noise Effects