Annotation Guidelines
The training data that will be provided as a gold standard have been manually tokenized and tagged according to the following guidelines:
Beißwenger, Michael; Bartz, Thomas; Storrer, Angelika; Westpfahl, Swantje (2015): Tagset und Richtlinie für das Part-of-Speech-Tagging von Sprachdaten aus Genres internetbasierter Kommunikation. Guideline document from the Empirikom shared task on automatic linguistic annotation of internet-based communication (EmpiriST 2015). (21 pages).
PDF (German): EmpiriST_Guideline-PoS.pdf
PDF (English): EmpiriST_guideline-PoS(english).pdf (translated by Sabine Bartsch)
Beißwenger, Michael; Bartsch, Sabine; Evert, Stefan; Würzner, Kay-Michael (2015): Richtlinie für die manuelle Tokenisierung von Sprachdaten aus Genres internetbasierter Kommunikation. Guideline document from the Empirikom shared task on automatic linguistic annotation of internet-based communication (EmpiriST 2015). (29 pages).
PDF (German): EmpiriST_Guideline-Tokenisierung.pdf
Ergänzungsdokument zu den Annotationsrichtlinien: Additional instructions and examples for selected PoS categories and tricky phenomena in CMC and social media data.
Online document (German): Google document
When citing these documents, please use the bibliographic information given above and refer to the URL http://sites.google.com/site/empirist2015/.
Note that our guideline for POS tagging is an extension and modification of the standard STTS (1999) tagset, and should be read in combination with the STTS guidelines:
Schiller, Anne, Teufel, Simone, Stöckert, Christine, and Thielen, Christine (1999). Guidelines für das Tagging deutscher Textcorpora mit STTS. Technical report, IMS, University of Stuttgart and SfS, University of Tübingen.
PDF: http://www.sfs.uni-tuebingen.de/resources/stts-1999.pdf
English description of the tagset: http://nachhalt.sfb632.uni-potsdam.de/owl-docu/stts.html
Overview: The part of speech tagset used for annotations:
Extensions to STTS (1999) are highlighted with blue background colour: