Historical Data Recognition

What is it?

Perhaps one of the most recognizable uses of machine learning in the library, historical data recognition, assists librarians to discover patterns between previously unlinked resources. Historical data recognition allows for deeper understanding, analysis, and interpretation of historical documents within a collection. Using machine learning methods and employing A.I. allows librarians to spend more time on higher quality tasks instead of spending what would take the average person years to organize huge collections of information or items. Machine learning techniques such as OCR, predictive analysis, handwriting recognition, and metadata recognition can all be used for different types of historical data recognition depending on the resource.

Several authors have identified areas of improvement in their research for the re-imagining of OCR errors (Cordell, 2020; Shoemaker, 2019; Smith, 2015) to point out areas where the software can improve, text can be interpreted, and positive changes can happen within the user experience with software, by employing the practice of hermeneutic certitude David Smith and Ryan Cordell’s combined research hovers at the forefront of philosophical document review, demonstrated in their 2013 findings from Infectious texts: Modeling text reuse in nineteenth-century newspapers, where OCR algorithms were used to identify reprinted articles in nineteenth century newspapers, which aimed to discover which articles were repeated, which ideas were being spread, and to which areas of the population they were spreading to at the time (Smith et. al., 2013).

In this example, OCR technology is used to understand the nature of print publication in nineteenth century American newsprint, which is a previously unexplored and misunderstood area of study (Smith et. al., 2013). Further research by David Smith and Ryan Cordell utilizing OCR algorithms to again find reprinted newspaper articles that influenced public opinion, this time during the Antebellum period in America, from Computational Methods for Uncovering Reprinted Texts in Antebellum Newspapers, finds that poems, book excerpts, and articles were reprinted frequently in the same publications, often originating from overseas sources (Smith et. al., 2013). Further research into this area by Smith and Cordell is being pursued by collaborating with researchers internationally, and across multiple languages, to further understand how nineteenth century Americans exchanged texts internationally (Smith et. al, 2015). Cordell and Mullen explain in their 2017 article, “Fugitive Verses”: The Circulation of Poems in Nineteenth-Century American Newspapers, that poetry published in newspapers in America in the early nineteenth-century are almost always authorless, and thus named “fugitive verses”, but their circulation popularity, content, and republishing provide an important view into the social context of the time (Cordell & Mullen, 2017).

Case Study: Historical Watermark Recognition

Above: Researchers from Ecole des Ponts ParisTech and Ecole Nationale des Chartes discover historical watermarks using Valeo AI recognition technology.

Conference: Collections as Data at the Library of Congress

Above: Three project presentations of how historical archives can be used as data for the future preservation of materials, and the methods these researchers used.

References

Cordell, R. (2020, July 14). Machine Learning + Libraries A Report on the State of the Field. LC Labs, Library of Congress.

Cordell, R., & Mullen, A. (2017). “Fugitive verses”: The circulation of poems in nineteenth-century american newspapers. American Periodicals, 27(1), 29-52.

Shoemaker, T. (2019). Error aligned. Textual Cultures : Text, Contexts, Interpretation, 12(1), 155-182. doi:10.14434/textual.v12i1.27153

Smith, Cordell, R., & Dillon, E. M. (2013). Infectious texts: Modeling text reuse in nineteenth-century newspapers. 2013 IEEE International Conference on Big Data, 86–94. https://doi.org/10.1109/BigData.2013.6691675

Smith, D. A., Cordell, R., & Dillon, E. M. (2013). Infectious texts: Modeling text reuse in nineteenth-century newspapers. Paper presented at the 86-94. doi:10.1109/BigData.2013.6691675

Smith, D. A., Cordell, R., & Mullen, A. (2015). Computational methods for uncovering reprinted texts in antebellum newspapers. American Literary History, 27(3), E1-E15. doi:10.1093/alh/ajv029

Page updated

Report abuse