Poor Access To Digitised Historical Texts: The Solutions of the IMPACT Project
Hildelies Balk, Programme Manager, IMPACT, European Union project for mass digitization of printed European culture
While there is an increasing demand for digitally available material (text that is not digital is becoming virtually invisible), digitised material is becoming available too slowly and in too small quantities. And even if the material is digitised, the OCR (optical character recognition) technology often does not produce satisfactory results, especially for historical documents. This is due to various problems such as historic fonts, complex layouts, ink shining through and historical spelling variants. After providing a short overview of how much noise there actually is in certain texts (depending on century, type of source, etc.), we will continue to describe the possible solutions currently being developed by the IMPACT project. Special attention will be devoted to the lexical resources: these can both improve OCR results by decreasing the number of errors as well as bridge the historical language barrier by enabling the user to retrieve historical spelling variants while searching for the modern word form.
About Hildelies Balk
Hildelies Balk has worked in the field of cultural heritage for over 20 years as a researcher and manager. In 2006 she joined the National Library of the Netherlands (KB) as Head of the National Programmes for digitisation. She coordinated the forming of the consortium and the submmission of the proposal for IMPACT and now acts as coordinator for this project. IMPACT is a European project that aims to speed up the process and enhance the quality of mass digitisation in Europe. The IMPACT research programme will significantly improve digital access to historical printed text through the development and use of innovative Optical Character Recognition software and linguistic technologies.