Digital Editions for Corpus Linguistics: Representing manuscript reality in electronic corpora

From Corpora: Pragmatics and Discourse. Papers from the 29th International Conference on English Language Research on Computerized Corpora (ICAME 29). Ascona, Switzerland, 14-18 May 2008 ed. by Andreas H. Jucker, Daniel Schreier & Marianne Hundt. Amsterdam & New York: Rodopi.

Alpo Honkapohja, Samuli Kaislaniemi & Ville Marttila

University of Helsinki

This paper introduces a new project, Digital Editions for Corpus Linguistics (DECL), which aims to create a framework for producing online editions of historical manuscripts suitable for both corpus linguistic and historical research. Up to now, few digital editions of historical texts have been designed with corpus linguistics in mind. Equally, few historical corpora have been compiled from original manuscripts. By combining the approaches of manuscript studies and corpus linguistics, DECL seeks to enable editors of historical manuscripts to create editions which also constitute corpora.

The DECL framework will consist of encoding guidelines compliant with the TEI XML standard, together with tools based on existing open source models and software projects. DECL editions will contain diplomatic transcriptions of the manuscripts, into which linguistic, palaeographic and codicological features will be encoded. Additional layers of contextual, codicological and linguistic annotation can be added freely to the editions using standoff XML tagging.

The paper first introduces the theoretical and research-ideological background of the DECL project, and then proceeds to discuss some of the limitations and problems of traditional digital editions and historical corpora. The solutions to these problems offered by DECL are then introduced, with reference to other projects offering similar solutions. Finally, the goals of the project are placed in the wider context of current trends in digital editing and corpus compilation.