The 5th conference CMC and Social Media Corpora for the Humanities will be held in Bolzano/Bozen, Italy on 3-4 October 2017. Please check the call for papers:

Proceedings of the 4th Conference on CMC and Social Media Corpora for the Humanities

We proudly present the Proceedings of the 4th Conference on CMC and Social Media Corpora for the Humanities which has been held on September 27-28, 2016 at the University of Ljubljana. There conference featured papers and presentations from 40 authors and co-authors from 24 research institutions in 11 countries. The complete proceedings are available open access:

Fišer, Darja; Beißwenger, Michael (eds., 2016.): 
Proceedings of the 4th Conference on CMC and Social Media Corpora for the Humanities (cmc-corpora2016). University of Ljubljana.

Proceedings of the EmpiriST shared task on automatic processing of German CMC and web corpora

The results of the EmpiriST 2015 shared task on tokenization and part-of-speech tagging of German CMC and web corpora have been presented as part of the 10th web as corpus workshop at ACL 2016 (WaC-X). The concept and results of the task as well as the participating systems are described in the following volume:

Proceedings of the 10th Web as Corpus Workshop (WAC-X) and the EmpiriST Shared Task. Stroudsburg: Association for Computational Linguistics, 2016 (ACL Anthology W16-26).

CfP: NLP4CMC2016: 3rd Workshop on Natural Language Processing for Computer-Mediated Communication / Social Media

NLP4CMC 2016: 3rd Workshop on Natural Language Processing for Computer-Mediated Communication / Social Media

Workshop at KONVENS 2015, Bochum/Germany September 22, 2016


Over the past decade, there has been a growing interest in collecting, processing and analyzing data from genres of social media and computer-mediated communication (CMC): As part of large corpora which have been automatically crawled from the web, CMC data are often regarded as an unloved “bycatch” which is difficult to handle with NLP tools that have been optimized for processing edited text; on the other hand, the existence of CMC data in web corpora is relevant for all research and application contexts which require data sets that represent the full diversity of genres and linguistic variation on the web. For corpus-based variational linguistics, CMC corpora are an important resource for closing the "CMC gap" both in corpora of contemporary written language and in corpora of spoken language: Since CMC and social media make up an important part of contemporary everyday communication, investigations into language change and linguistic variation need to be able to include CMC and social media data into their empirical analyses. Nevertheless, the development of approaches and tools for processing the linguistic and structural peculiarities of CMC genres and for building CMC corpora is lacking behind the interest of dealing with these types of data in the field of language technology, corpus-based linguistics and web mining.

The goal of the NLP4CMC workshops which are organized by the GSCL special interest group "Social Media / Computer-Mediated Communication" is to provide a platform for the presentation of results and the discussion of ongoing work in adapting NLP tools for processing CMC data and in using NLP solutions for building and annotating social media corpora. The main focus of the workshops is on German data, but submissions on NLP approaches, annotation experiments and CMC corpus projects for data of other European languages are also welcome. The 1st NLP4CMC workshop was held in September 2014 at KONVENS at the University of Hildesheim. The 2nd NLP4CMC workshop was held in September 2015 at the international conference of the German Society forLanguage Technology and Computational Linguistics (GSCL) at the University of Duisburg-Essen. The papers from both workshops have been published online.


We encourage the submission of research papers on best practices in building, annotating and processing corpora and lexical semantic resources for the analysis of social media / computer-mediated communication (CMC) - including, but not restricted to the following topics:

  • Collection, representation, maintenance and computer-assisted/automatic analysis of CMC and social media resources
  • Normalization (spelling correction, ...)
  • Automatic preprocessing (tokenization, POS tagging, lemmatization, parsing, word sense disambiguation)
  • Annotation of linguistic and structural features in social media / CMC data (annotation schemas, annotation experiments, metadata ...)
  • Domain adaptation
  • Automatic methods in corpus-based CMC / social media analysis (sentiment analysis, summarization, topic detection, trend detection, ...)
  • Big-data social media analysis

Besides individual papers the workshop program will include a round-table discussion with participants from the GSCL Shared Task on Automatic Linguistic Annotation of CMC / Social Media Corpora (EmpiriST2015) which will present and discuss results from the project and future perspectives for adapting NLP systems to CMC and social media data.


  • Submissions due: 30 June 2016
  • Notification (reviews due): 31 July 2016
  • Camera-ready papers (revised versions) due: 22 August 2016
  • Workshop: 22 September 2016


Submissions should include the names and addresses of all authors and meet the following requirements:


  • Sabine Bartsch, TU Darmstadt
  • Stefanie Dipper, Ruhr University Bochum
  • Stefan Evert, University of Erlangen-Nürnberg
  • Iris Hendrickx, Radboud University Nijmegen
  • Verena Henrich, University of Tübingen
  • Axel Herold, Berlin-Brandenburg Academy of Sciences (BBAW), Berlin
  • Andrea Horbach, University of Saarbrücken
  • Tobias Horsmann, University of Duisburg-Essen
  • Anke Lüdeling, Humboldt University Berlin
  • Harald Lüngen, Institute for the German Language (IDS), Mannheim
  • Preslav Nakov, Qatar QCRI
  • Ines Rehbein, University of Potsdam
  • Roman Schneider, Institute for the German Language (IDS), Mannheim
  • Egon W. Stemle, EURAC, Bozen ?
  • Angelika Storrer, University of Mannheim
  • Simone Ueberwasser, University of Zürich
  • Kay-Michael Würzner, Berlin-Brandenburg Academy of Sciences (BBAW), Berlin

(more to be announced)


  • Michael Beißwenger (University of Duisburg-Essen, German Linguistics)
  • Michael Wojatzki (University of Duisburg-Essen, Language Technology Lab)
  • Torsten Zesch (University of Duisburg-Essen, Language Technology Lab)

The workshop is organized by the special interest group "Social Media /
Computer-Mediated Communication" of the German Society for Computational
Linguistics & Language Technology (GSCL) (

CfP: cmc-corpora 2016 (Ljubljana)

The waiting has an end:
We proudly present the CfP for the 2016 issue of the cmc-corpora conference:

Call for Papers: 4th Conference on CMC and Social Media Corpora for the Humanities, 27-28 September 2016, Ljubljana, Slovenia

Call for Participation: Shared Task on Processing German CMC/Social Media & Web Data

The EmpiriST 2015 shared task aims to encourage the developers of NLP applications to adapt their tools and resources for the processing of written German discourse in genres of computer-mediated communication (CMC) - such as chats, forums, wiki talk pages, tweets, blog comments, social networks, SMS and WhatsApp dialogues - as well as monological web pages - such as personal or professional blogs, Wikipedia articles, academic sites, etc.

The shared task is divided into two subtasks (A: tokenization, B: POS tagging) and two different data sets (CMC subset, web corpora subset). While our main goal is to foster the development of robust tools that work well on a wide range of CMC & web genres, teams are allowed to focus on one subtask or one subset only. Full manually annotated training data are available now on the EmpiriST homepage, comprising approx. 5000 tokens for each subset.

Results and system descriptions will be presented in the WAC-X workshop co-located with ACL 2016 in Berlin, Germany (11 or 12 August 2016).

For more information, including detailed annotation guidelines and instructions for participation, see the EmpiriST homepage at

and join our Google group for updates, questions and discussion:

While EmpiriST is focussed on the annotation of German-language data, familiarity with German is not essential for participating in the task. There are sufficient amounts of training data for general machine learning, domain adaptation and optimization approaches. We also provide an English summary of the POS tagset and annotation guidelines.


20.12.2015        Release of the training data
31.01.2016        Team registration
15.02.2016        Release of the evaluation data for the tokenization subtask
19.02.2016        Submission deadline for the tokenization subtask
22.02.2016        Release of the evaluation data for the POS-tagging subtask
26.02.2016        Submission deadline for the POS-tagging subtask
ca. April 2016        Submission of system description papers
11/12.08.2016        Presentation of systems and task results at WAC-X workshop
(ACL 2016, Berlin)


CMC data set:
 * Michael Beißwenger (Technische Universität Dortmund)
 * Kay-Michael Würzner (Berlin-Brandenburgische Akademie der Wissenschaften)

Web corpora data set:
 * Sabine Bartsch (Technische Universität Darmstadt)
 * Stefan Evert (Universität Erlangen-Nürnberg)

Contact address:

CfP: Resources, tools and methods for analysing CMC

Call for papers for a special issue of the open-access journal Slovenščina 2.0: "Resources, tools and methods for analysing computer-mediated communication”
You are kindly invited to submit your paper for a special issue of the open-access journal Slovenščina 2.0 on resources, tools and methods for analysing computer-mediated communication that will be published in August 2016. We welcome papers reporting on novel and completed research as well as survey papers, position papers, review papers and project reports. The topics include but are not limited to: • construction and distribution of CMC corpora • tools and resources for processing of CMC • corpus analyses of CMC • comparisons of CMC with standard and/or spoken discourse • sociolinguistic studies of CMC • code-switching in CMC • neologism and semantic shift detection in CMC • offensive language in CMC We welcome manuscripts written in English and Slovene. Please follow the instructions for manuscript submission: Important dates: • 31. 03. 2016 – Submission of manuscripts • 31. 05. 2016 – Notification of acceptance/rejection • 31. 06. 2016 – Submission of final versions • 15. 07. 2016 – Formatting of the special issue • 30. 07. 2016 – Submission of proofs • 15. 08. 2016 – Publication of the special issue Editors-in-chief: • Nataša Logar (UL FDV) • Polona Gantar (UL FF) Guest editor: • Darja Fišer (UL FF) Reviewers:

• Michael Beißwenger (TUD) • Helena Dobrovoljc (ZRC SAZU and FH UNG) • Vojko Gorjanc (UL FF) • Axel Herold (BBAW) • Simon Krek (IJS) • Lothar Lemnitzer (BBAW) • Harald Lüngen (IDS) • Dunja Mladenić (IJS) • Marko Stabej (UL FF) • Egon W. Stemle (EURAC) • Marko Robnik Šikonja (UL FRI) • Darinka Verdonik (UM FERI) • Ciara R. Wigham (ICAR)

Conference: Corpus-linguistic and NLP aspects of Internet-based Slovene

The members of the JANIES project have been organizing a conference on corpus-linguistc and NLP aspects of CMC discourse (Slovenščina na spletu in v novih medijih, University of Ljubljana, 25.–27. november 2015).
A documentation of the conference can be found on the conference website:

Annotation schemas for CMC and social media genres: resources from the TEI special interest group on CMC

The special interest group "computer-mediated communication" of the Text Encoding Initiative (TEI) provides schemas for the annotation of cmc and social media genres which are conformant with the TEI standards for text encoding and which have been tested in corpus projects on French and German CMC:

Wiki pages of the special interest group:
Website of the Text Encoding Initiative:

The schemas have been presented and discussed as part of the 2015 cmc-corpora conference in Rennes/FR.

