Above: A short description of OCR and examples of current uses.
What is it?
Optical Character Recognition (OCR) is described as the computerized conversion of text from a photographic image into data that can be recognized and translated by a computer into typed text (Alpert-Abrams, 2016). This type of machine learning can be useful to improve the accuracy of datasets, simply by adding more data by taking better quality pictures of the intended subject. Two examples of OCR being used to improve the accuracy of datasets are Google Maps using Street View photos to better link landmarks with their specific geographical location, and the United States Post Office using it to identify the correct pieces of mail being distributed.
Libraries and cultural heritage institutions can rely on OCR technology to digitize collections from past and present.
Example of Workflow
One example of a workflow process for OCR item digitization includes pre-processing, layout analysis, recognition, and correction (Blanke et. al., 2012). Pre-processing includes the automated removal of noise and other imperfections from the object to improve its visual quality before extracting data. Layout analysis involves the automatic dissection of the object’s features, design, and alignment. If the object is not aligned properly, such as if a scanned document were to be input at an angle, the OCR technology will not work properly, producing a potentially illegible translation of the desired text. Recognition involves the automated retrieval of textual and visual information from the item for transcription. The final step, correction, is not automated, and it is currently necessary for a human to check over the item, the transcription, and the accuracy of the information. (Blanke et. al., 2012). Early OCR software and algorithms were created by commercial corporations to digitize large reams of corporate documents efficiently. When these early algorithms intended for use with a few specific commercial document formats are applied to historical documents, or resources with complex formats, this is when multiple errors happen (Milligan, 2013)
How can OCR benefit libraries?
OCR technology has great potential to improve library and knowledge institution collections, but the inaccuracy of the finished transcriptions, coupled with the need for human attention to correct inaccuracies, is a major downside. Another issue with using OCR technology to digitize collections relates to research purposes. If all documents in a collection are digitized using OCR, the system will not be able to recognize the varying levels of importance between resources. If a researcher is searching for specific keywords in the library catalog, then all documents with the suggested keyword will appear in the search, not just academic resources (Lee, 2014). This is yet another scenario where a human is needed to supervise the A.I. OCR, and to check behind the system for correctness.
Readily-Available Machine Learning Software
As Kim (2021) notes, there are a few readily available machine learning software programs provided by major tech companies that use OCR, that libraries and archives can easily purchase via subscription service to better catalog, protect, and organize their archival materials or collection items. Machine learning software that is ready to use is already pre-trained with the technology company or creator’s metadata, which enables the program to recognize various types of information, such as facial, handwriting, text, or image recognition, depending on the software. A couple of examples of machine learning programs that are pre-trained and ready to use that Kim mentions in the aforementioned article are Amazon’s AWS Rekognition and Google Cloud Vision AI (Kim, 2021). Kraken OCR software is another example of an open-source, neural network powered OCR solution, that can handle complex image formats for ease of use by the user, such as newspaper articles (Miller et. al., 2018). Kraken OCR software was developed by the OpenITI team at Leipzig University in Leipzig, Germany and can integrate with other systems such as the Pybossa microtask/crowdfunding platform, and the Nidaba OCR pipeline (Miller, Romanov, & Savant, 2018). The benefit of using an OCR software that is able to run other programs is ease of use for the user and the greater possibility to accommodate different types of projects that are complex and in different formats.
References
Blanke, T., Bryant, M., & Hedges, M. (2012). Open-source optical character recognition for historical research. Journal of Documentation, 68(5), 659-683. doi:10.1108/00220411211256021
Hannah Alpert-Abrams. (2016). Machine reading the primeros libros. Digital Humanities Quarterly, 10(4)
Kim, B. (2021). Machine learning for libraries and archives. Online Searcher; Online Searcher, 45(1), 39-41.
Lee, M. S. (2014). Falsifiability, confirmation bias, and textual promiscuity. J19, 2(1), 162-171. doi:10.1353/jnc.2014.0014
Milligan, I. (2013). Illusionary order: Online databases, optical character recognition, and canadian history, 1997–2010. The Canadian Historical Review, 94(4), 540-569. doi:10.3138/chr.694
Miller, M. T., Romanov, M. G., & Savant, S. B. (2018). Digitizing the textual heritage of the premodern islamicate world: Principles and plans. International Journal of Middle East Studies; Int.J.Middle East Stud, 50(1), 103-109. doi:10.1017/S0020743817000964