Integrating a new type of language resource into the Digital Humanities landscape: French-German colloquium on standards for corpora of computer-mediated communication

University of Duisburg-Essen, June 19-20, 2017

Concept & goals

The challenge: standards for a new type of language resource

"Communication that takes place between human beings via the instrumentality of computers” (Herring, 1996:1) termed Computer-Mediated Communication (CMC), has invaded today’s society. This includes many communication technologies such as social networks, tweets, SMS or WhatsApp but also specialized forums, chats, weblogs, and wikis on a variety of topics that can widely impact people of different generations, cultures, and social classes. Language and the interaction mediated via these technologies are nowadays analysed for a wide range of purposes both in science and society. These include research to understand the impact of CMC on language change and on the writing skills of adolescents, to explore the use of CMC technologies as a means for social participation, and to explore phenomena of language contact in CMC environments.

Despite the growing interest in CMC and social media data in the Humanities, in Natural Language Processing (NLP) and NLP-intensive business enterprises that are developing approaches and tools for web mining, information retrieval, opinion and trend detection, there are only very few collections of CMC data (CMC corpora) available as resources that may be used and exploited for research and development without restrictions. A major reason for this is the lack of acknowledged standards for creating and representing this type of corpora. The colloquium will contribute to the creation of relevant standards by elaborating on specifications for state-of-the-art CMC corpora which include representatives of different stakeholder groups (corpus providers, CMC researchers, experts in language resource infrastructures and Digital Humanities, NLP researchers and developers).

Previous work and state-of-the-art

In the past years, there has been increasing awareness of the need to close the ‘CMC gap’ in the corpora landscape. A growing number of projects is currently creating CMC corpora for a broad range of CMC genres (e.g., chats, blog comments, tweets, usenet discussions, Wikipedia talk pages, SMS, whatsapp interactions, interaction in multimodal online environments) and languages.

Informal exchanges between the French and German projects (CoMeRe, Empirikom) brought about the grassroots creation of the international conference series CMCCORPORA and the installation of a special interest group on CMC in the Text Encoding Initiative (TEI-CMC SIG) to work on suggestions for a TEI standard for the representation of CMC genres. Even though there are still many open issues, previous and ongoing projects have created a solid pool of experience and best practices in dealing with the structural and linguistic peculiaritires of CMC discourse for building and annotating corpora (Fišer/Beißwenger 2016; Beißwenger et al., 2017 forthc.). This pool provides a promising basis to start from to create standards for CMC corpora in a bottom-up approach.

Objectives to advance the state-of-the-art: interoperability and standards

The experience collected in a range of national projects underline the increasing need for the development of standards for the collection, representation, linguistic annotation, and provision of CMC corpora at a European level. The use of the common basic procedures and schemas for the composition, representation, annotation and provision of CMC corpora in different languages and projects will allow for the combined use and exploitation of (a) CMC resources for different languages and genres provided by different corpus providers, (b) CMC corpora and language resources of other type (e.g. text corpora, speech corpora, historical corpora, learner corpora). This will facilitate both comparative research on CMC phenomena in different languages and in different CMC genres, and comparative research of language use in CMC with spoken, dialogic, and written, monologic language. Furthermore, it will allow CMC corpora to be integrated into existing language resource infrastructures as provided e.g. by initiatives like CLARIN, DARIAH, and ORTOLANG.

The standards envisioned by the organizers of the colloquium shall allow CMC corpora to be:

available as open access resources and as part of acknowledged pan-European infrastructures;
designed as sustainable and reusable resources through the use of open, non-proprietary encoding, exchange and metadata standards;
interoperable through compliance with acknowledged standards in the field of Digital Humanities;
linguistically annotated to through applying NLP tools that have been adapted to the structural and linguistic peculiarities of CMC discourse.

Goals and anticipated results of the colloquium

The objective of the colloquium is a survey of the state-of-the-art for representing and annotating corpora of computer-mediated communication (CMC corpora) in the Humanities and of key issues that have to be solved as a prerequisite for combining, connecting and merging CMC corpora for different languages and genres amongst each other and with corpora of other types (text corpora, spoken language corpora).

The result of the colloquium will serve as a specification for further work on DH standards on fundamental aspects of corpus creation (retrieval and representation of metadata, structural representation of CMC genres, linguistic annotation, provision as part of language resource infrastructures). To reach its objective, the colloquium brings together not only creators of CMC corpora and scholars interested in corpus-based CMC research, but also representatives of language resource infrastructure projects (CLARIN-D and ORTOLANG) as well as of large, existing collections of text and speech corpora (IDS Mannheim, Berlin-Brandenburg Academy of Sciences, CLAPI-ICAR Laboratory, Corpus de la parole - Ministère de la culture et de la communication). For the specification of requirements for the linguistic processing of CMC data, the colloquium includes researchers from the field of natural language processing (NLP). The expertise and resources of the participants form a pool of best practices upon which to build.

The interdisciplinary composition of the participants will guarantee that the results of the colloquium have an impact in different disciplines. Results of the colloquium will be made available on the web and shall be presented at the 5th Conference on CMC and social Media Corpora in the Humanities in Bolzano/Italy, October 2017. They will serve as an input for ongoing and future CMC corpus projects in different languages.

References

Beißwenger, Michael; Chanier, Thierry; Erjavec, Tomaž; Fišer, Darja; Herold, Axel; Lubešic, Nikola; Lüngen, Harald; Poudat, Céline; Stemle, Egon; Storrer, Angelika; Wigham, Ciara (2017, forthc.): Closing a Gap in the Language Resources Landscape: Groundwork and Best Practices from Projects on Computer-mediated Communication in four European Countries. In: Selected Papers from the CLARIN Annual Conference 2016, October 26–28, 2016, Aix-en-Provence, France. Linköping University Electronic Conference Proceedings.
Fišer, Darja; Beißwenger, Michael (Eds., 2016): Proceedings of the 4th Conference on CMC and Social Media Corpora for the Humanities (cmc-corpora2016). Ljubljana, Slovenia, 27-28 September 2016. http://nl.ijs.si/janes/cmc-corpora2016/proceedings/
Herring, Susan C. (Ed.) (1996). Computer-Mediated Communication. Linguistic, Social and Cross-Cultural Perspectives. Amsterdam/Philadelphia: John Benjamins.

Google Sites

Report abuse