GSCL Shared Task: Automatic Linguistic Annotation of Computer-Mediated Communication / Social Media
Welcome to the website of the EmpiriST 2015 shared task on automatic linguistic annotation of computer-mediated communication / social media.
Latest news:
official results and complete gold standard have been released (see task description paper for details)
presentation slides with task overview — slides for system descriptions available from the 10th Web as Corpus Workshop (WAC-X)
The goal of this shared task is to encourage the developers of NLP applications to adapt their tools and resources for the processing of written German discourse in genres of computer-mediated communication (CMC). Examples for CMC genres are chats, forums, wiki talk pages, tweets, blog comments, social networks, SMS and WhatsApp dialogues.
Processing CMC discourse is a desideratum and a relevant task in different research fields and application contexts in the Digital Humanities - e.g.:
in the context of building, processing and analyzing corpora of computer-mediated communication / social media (chat corpora, news corpora, whatsapp corpora, ...)
in the context of collecting, processing and analyzing large, genre-heterogenous web corpora as resources in the field of Language Technology / Data Mining
in the context of dealing with CMC data in corpus-based analyses on contemporary written language, language variation and language change
in all research fields beyond linguistics which address social, cultural and educational aspects of social media and CMC technologies using language data from CMC genres
The shared task consists of two subtasks:
Tokenization of CMC discourse
Part-of-speech tagging of CMC discourse
The two subtasks will have to be handled for two different data sets:
CMC data set: a selection of data from different CMC genres (social chat, professional chat, Wikipedia talk pages, blog comments, tweets, WhatsApp dialogues).
Web corpora data set: a selection of data which represents written discourse from heterogenuous WWW genres. It consists of crawled websites including small portions of CMC discourse (e.g. webpages, blogs, news sites, blog commentary etc.).
We will provide training data sets which have been manually tokenized and tagged on the basis of detailed annotation guidelines.
Before the release of the full task we will publish a small set of trial data which may be used by developers. Annotation guidelines which have been used for annotating the trial and training data are available, too.
The shared task (ST) has been prepared by members of the DFG scientific network Empirikom (therefore: "EmpiriST"). Its preparation has been funded by the German Society for Language Technology and Computational Linguistics (GSCL).
The shared task is endorsed by the ACL Special Interest Group on the Web as Corpus and by the GSCL Special Interest Group on Social Media / Computer-Mediated Communication.