Public Studies

The CRITT TPR-DB consists of several studies that were conducted with Translog-II or with the CASMACAT workbench. Raw logging and aligned data for all public sessions is available on sourceforge https://sourceforge.net/projects/tprdb/ and can be checked out via svn:

Post-processed and compiled versions of the data consists of several summary files per study. All summary tables are zipped and can be downloaded via the TPR-DB management tool. Some older versions are here: https://sourceforge.net/projects/tprdb/files/

Post your technical, methodological, and theoretical questions and comments here.

Inspect (public) studies in the TPR-DB

You can inspect alignments of public studies in the TPR-DB.

go to the page: http://dighum1.ftsk.uni-mainz.de/cgi-bin/yawat/yawat.cgi

login as:

user: TPRDB

password: tprdb

Click on any of the listed studies and then on any session. You can hover over the words and see the word alignments. In your own uploaded study (see below) you can also modify alignments: Use left mouse cursors to open and modify the alignments, use right mouse button to close the alignment and click the "done" button to confirm the changes.

NOTE: Please use Firefox to work with YAWAT (avoid IExplorer or Chrome).

Reference:

Ulrich Germann Yawat. 2008. Yet Another Word Alignment Tool Proceedings of the ACL-08: HLT Demo Session (Companion Volume) , pages 20–23. Association for Computational Linguistics http://www.aclweb.org/anthology/P08-4006

You can download the TPR-DB from the TPR-DB website

Another visualization of translation process data with R is shown on this page

Annotating alignment groups in YAWAT

In a personal account you can align translation sessions. ST words should be aligned with the TT words as complete and as compositionally as possible, i.e., try to align every single word and all punctuation marks, but try to create the smallest possible alignment. For example, if the source text says [Killer nurse] and the Spanish TT says [enfermero asesino], then align [Killer - asesino] and [nurse-enfermero] and not [killer nurse - asesino enfermero]. :

  1. Select the alignment group by left-clicking on the ST and TT elements to be aligned. For example, left-click on “killer” and then on “asesino”.
  2. To confirm that you have created an alignment group, right-click on one of the elements of the group. For example, “asesino”. Both elements will form an alignment group and they will be marked on grey.
  3. In order to annotate the group, right-click on one of the elements (for instance, “asesino”). If there is no error, left-click on the default label “Unspecified (no error)”. If there is an error in the aligned group, left-click on the error to be annotated from the 10 errors in the section “error codes”.
  4. In case you would like to annotate an unaligned ST or TT word (mono-label), right-click on the unaligned ST or TT word and select one of the two categories possible (Addition/Omission and Unintelligible)..
  5. Once you are done with the annotation of a segment, click on “done” next to the segment number. Changes will be saved.

It sometimes is the case, that the automatic segment alignment is not correct. If that is the case, please follow instructions here.

To get a personal Yawat account, please send an email to m.gummiball[at]gmail.com.

Post your technical, methodological, and theoretical questions and comments here.

The study names links to the raw data, in brackets are indicated primary references that make use of this resource.Studies that were conducted with Translog:

  • ACS08: This study explores the way in which translators process the meaning of non-literal expressions by investigating the gaze times associated with these expressions. (Sjørup, 2013)
  • BD08: This study involves Danish professional translators working from English into Danish. (Dragsted, 2010)
  • BD13: This study involves secondary school students translating and post-editing from English into Danish.
  • DG01: The study compares students, professional and non-professional translators with and without a representation of the text (Płońska, 2015)
  • GS12: This study contains post-editing data of four pieces of news from Spanish into English.
  • HLR13: This is a translation study from English into Estonian (5 participants translating 3 different texts).
  • JIN15: Data of Jin Huang's PhD thesis
  • JLG10: This study investigates L1 and L2 translations from/to English and Brazilian Portuguese.
  • JTD16: Investigate the learning effects in Japanese translation dictation (7 participants, translated 12 texts on 6 successive days)
  • LWB09: This study reports on an eye tracking experiment in which professional translators were asked to translate two texts from L1 Danish into L2 English. (Jensen et al 2009)
  • MS13: This study is an investigation of translator’s behaviour when translating and post-editing Portuguese and Chinese in both language directions.
  • RH12: This is an authoring study for the production of news by two Spanish journalists.
  • ZHPT12: This study investigates translator’s behaviour when translating journalistic texts. The specific aim is to explore translation process research while processing non-literal (metaphoric) expressions.

multLing

The aim of the mltiLing data set is to compare from-scratch translation (T), post-editing (P) and monolingual post-editing (E), and recently translation dictation (D) and spoken translation (S) for different translators and for different languages. The six English source texts are translated by student and experienced translators; four texts (1-4) are news, two texts (5-6) sociological texts from an encyclopedia. Texts were permuted in a systematic manner so as to make sure that each text was translated by every translator and every translator translated two different texts in each translation mode.

  • BML12: This study contains translating, post-editing and editing data of six texts from English into Spanish. (Mesa-Lao, 2012, Mesa-Lao, 2013)
  • ENJA15: The goal of this study was to compare Englich-to-Chinese translation dictation, post-editing and from-scratch translation (Carl et al, 2016)
  • KTHJ08: This study contains only translation data for the news text 1-3. (Hvelplund 2011)
  • MS12: This study contains translating, post-editing and editing of the six texts English into Chinese.
  • RUC17: This study contains translating and post-editing of the six texts English into Chinese.
  • NJ12: This study contains translating, post-editing and editing of the six texts English into Hindi by professional translators.
  • SG12: This study contains translating, post-editing and editing of the six texts English into German. (Carl et al 2015)
  • TDA14: In this study participants were asked to copying the six English texts (Bangalore et al, 2015).
  • WARDHA13: This study contains translating, post-editing and editing of the six texts English into Hindi by students.

Studies conducted with the CASMACAT workbenches include:

  • ALG14: This study compares professional translator and bilinguals while post-editing with the third prototype of the CASMACAT workbench featuring visualization of word alignments. (Alabau et al, 2014)
  • CEMPT13: This study contains post-editing recordings with the second prototype of the CASMACAT workbench, featuring interactive machine translation. (Alves et al, 2015)
  • CFT12: This study contains data of the first CASMACAT field trial from June 2012, comparing post-editing with from-scratch translation. (Elming et al, 2014)
  • CFT13: This study contains data of the second CASMACAT field trial from June 2013, comparing post-editing and interactive machine translation. (Carl et al. 2014)
  • CFT14: This study contains data of the second CASMACAT field trial from June 2014, comparing interactive machine translation and online learning. (Alabau et al, 2014, Zapata 2015)
  • EFT14: The study compares active and online learning during interactive translation prediction (Ortiz-Martínez et al 2015)
  • JN13: This study is recorded with the second prototype of the CASMACAT workbench featuring interactive machine translation and word alignments. (Nitzke and Oster, 2015)
  • LS14: This study investigates learning effects with interactive post-editing over a period of six week with the third prototype of the CASMACAT workbench. (Alabau et al, 2014, Alabau et al 2015)
  • PFT13: This study is a pre-field trial test prior to the second CASMACAT field trial.
  • PFT14: This study is a pre-field trial test prior to the third CASMACAT field trial.
  • ROBOT14: This study investigates usage of external resources during translation and post-editing (Daems et al, 2015)

Data from the CASMACAT field trial 2014 (CFT14):

Seven post-editors post-editing each two texts in plain post-editing mode and under active learning conditions. Revisions of the post-edited texts with hand-writing recognition.

Data from the CASMACAT longitudinal study 2014 (LS14):

Five post-editors post-editing each 24 files during a periode of 6 weeks between May and June 2014 in plain post-editing and interactive mode. In all 120 translation sessions from which 35 are with recorded gaze data

Data from the CASMACAT field trial 2013 (CFT13):

Logging data of 81 post-editing and revision sessions, more than 120 hours of user activity data, recorded with CASMACAT workbench v.2.0:

  • CFT13 raw logging data
  • CFT13 segments (v 1.4) - segment editing and revision information
    • For a description of the features see: Michael Carl, Mercedes Martínez García and Bartolomé Mesa-Lao (2014) "CFT13: A new resource for research into the post-editing process", In Proceedings of LREC 2014
  • CFT13 segments (v 1.3) - segment editing and revision information
  • CFT13 TPR-DB (v1.2) - compiled into TPR-DB format (available upon request (mc.ibc@cbs.dk)
    • more recent versions are contained in the TPR-DB
  • CFT13 videos - videos available upon request (mc.ibc@cbs.dk)

Data from the CASMACAT pre-field trial 2013 (PFT13):

This data was recorded with CASMACAT workbench v.2.0:

Data from the CASMACAT field trial 2012 (CFT12):

Logging data of 89 translation sessions English -> Spanish recorded with the CASMACAT workbench v.1.0

  • CFT12 - raw logging data
  • CFT12 - compiled into TPR-DB format (available upon request (mc.ibc@cbs.dk)
  • CFT12 - link to the enriched data as prepared by the group in Wolverhampton.

OVERVIEW OVER CONTENTS OF STUDIES IN THE TPR-DB (V2.265)