About
Handwritten text recognition (HTR) concerns the conversion of scanned images of handwritten text into machine-encoded text. In contrast with optical character recognition (OCR), where the text to be transcribed is printed, HTR is more challenging and can lead to transcribed text that includes many more errors or even to no transcription at all when training data on the specific script (e.g., medieval) are not available. Post-correcting HTR'ed output, on the other hand, has been reported as expensive, time-consuming, and challenging for the human expert. This workshop will present (Day 1) the outcomes of a challenge on the error correction of the HTR output of Greek Papyri and Byzantine manuscripts. Then (Day 2), the future of the challenge will become the focus of the workshop, evaluating the expansion of the studied languages and genres, but also, moving from West to East. The discussion will focus on Japanese and Chinese pre-modern materials.
The digitisation of historical materials (texts, images) is essential for the preservation, study and understanding of cultural heritage on the global level. However, handwritten text recognition (HTR) is challenging in the case of East Asian historical materials produced in pre-modern China, Korea and Japan, especially manuscripts written in cursive script. Interestingly, similar problems are observable in the transcription of Japanese woodblock-printed materials in which texts and image inscriptions are printed from single hand-carved woodblocks. Japan has a very long textual and printing history (the first printed materials date from the 8th c.) that resulted in an extremely rich corpus of textual hand-written and printed materials including texts in diverse formats (e.g. books, leaflets etc.) and image inscriptions (e.g. on maps, art images). The Union Catalogue of Early Japanese Books kept by the Japanese National Institute of Japanese Literature alone holds around 500.000 books while the Ukiyo-e Portal Database at the Art Research Center Ritsumeikan University, Kyoto hosts approx. 700,000 (678,429) printed images, kept at institutions in Japan and abroad. The great majority of these texts were printed with the use of hand-carved wooden printing blocks (not movable type) in which the texts were rendered in a cursive free-floating writing style with a great linguistic and stylistic diversity (showcasing diverse ‘hands’ in printed materials). In contrast to Europe or China and Korea, movable-type printing technology was not generally used in Japan until the second half of the 19th century (with some sporadic exceptions).
This material characteristic of East Asian textual heritage makes it difficult to be accessed by contemporary researchers and the wider public in East Asia and abroad, mainly because the reading skills of premodern texts are not common and transcription cannot be easily enhanced by computational technologies. Natural Language Processing (NLP) has the potential to improve this situation, as has been evidenced in Grammatical Error Correction (Wang et al. 2021) and reflected by the reduction in character error rate achieved during HTREC 2022.
The future challenge HTREC 2023 offers an opportunity to deepen access and understanding of global (not only Western) cultural heritage. Besides making Japanese and Chinese printed materials more accessible and better understood today, Error Correcting HTR for woodblock-printed historical texts could also serve as the key step towards developing more effective technologies for the automated transcription of East Asian historical manuscripts, still largely awaiting scholarly and public attention. The future challenge is also expected to improve Error Correction, by developing and testing NLP methods in a different linguistic environment outside of the Indo-European language family and using complex writing systems with logographic characters and two phonetic alphabets.