Public Studies

Download public TPR-DB data

The CRITT TPR-DB consists of studies that were conducted with Translog-II, Trados or with the CASMACAT workbench. Compiled summary tables can be download via the CRITT TPR-DB management tool:

1. go to login page: https://critt.as.kent.edu/cgi-bin/yawat/yawat.cgi
2. login as public user: TPRDB password: tprdb
3. change to management page: https://critt.as.kent.edu/cgi-bin/yawat/tpd.cgi
4. click on the Download buttons for Tables, Alignment or Translog-II (raw logging data for all data acquisition tools) of the studies that you are interested in.

The Raw logging and aligned data for all public sessions are available on sourceforge https://sourceforge.net/projects/tprdb/ and can be checked out via svn:

svn checkout --username=tprdb https://svn.code.sf.net/p/tprdb/svn/tprdb-svn

Post your technical, methodological, and theoretical questions and comments here.

Visualize Translation Progression Graphs

go to http://critt.as.kent.edu:3838/mcarl6/ProgGraph/
select the study and session you want to visualize
click on "Load selected session"
adjust time and text segment for which you want to inspect behavioral data
you can also visualize different types of units to fragment the stream of data
or/and look at (some of the) TPR-DB tables for the selected segment

Translation Progression Graphs for public studies can be interactively scrutinized by following this link:

http://critt.as.kent.edu:3838/mcarl6/ProgGraph/
note: this works with http, not with https. It works with the MS-Edge browser.

Translog Data

Studies that were conducted with Translog-II. The study names links to the raw data, in brackets are indicated primary references that make use of this resource. The references should be spelled out on the CRITT publication website.

The TPR-DB contains several collections of studies which make use of the same English source texts. These source texts are identically tokenized and segmented across all studies, but are used in different text production (or reception) modes and text production into different languages.

multiLing

The multiLing data set is based on six English source texts which are translated into various languages. Four of them (Texts 1-4) are news articles and the other two are (Texts 5-6) sociological texts from an encyclopedia. The Data can be downloaded from here (User: TPRBD, passwd: tprdb). The source text data (project files for Translog-II, tokenized *src files, and texts) can be downloaded from this link. Publications that refer to the data are given in brackets.

Into Arabic

- AR19 : from-scratch translating (T), post-editing (P), and sight translation (S)
- ARMT19: alignment of 12 different MT systems (See also ESMT19, JAMT19, ZHMT19) (Ogawa et al 2021)

Into Chinese

- HNUJml: from-scratch translating (T) and post-editing (P), only texts 1 and 2 (Jia et. al, 2019)
- MS12: post-editing (P) and editing (E) (many sessions missing)
- RUC17: from-scratch translating (T) and post-editing (P), 21 MTI students
- RUCMT17: google MT error annotation by 16 translation students (Carl and Báez. 2019.)
- STC17: from-scratch translation (T), post-editing (P), sight translation (ST) , 16 MTI students
- STC17bolt: the STC17 study, re-aligned according to the BOLT guidelines
- STML18: sight translation (ST), 9 professional interpreters
- STML18bolt: the STML18 study, re-aligned according to the BOLT guidelines
- ZHMT19: alignment of 12 different MT systems (See also ARMT19, ESMT19, JAMT19)

Into Danish

- KTHJ08: from-scratch translation (T) 12 students and 12 professionals, only news text 1-3 (Hvelplund 2011)

Into Dutch

ENDU20: Ten native Dutch recent master’s degree translators, only keyboard logging with Translog-II. (Vanroy, 2021)
ENDU20-MT: Two Dutch MT DeepL (P20) and Google Translate (P21) from 2020. (Vanroy, 2021)

Into English

- TDA14: copying (C), 16 German translation students (Bangalore et al, 2015).
- CS19: copying (C), paraphrasing (H), summarization (U), 13 computer sciences students (Sahoo & Carl 2019)
- PARAP: copying (C), paraphrasing (H), computer sciences students computer sciences students
- BACK2020: PhD students' back-translation 9*via Arabic and 4*via Chinese (P08, P11, P13, P15). P03 and P15 are MT. Source texts are substituted (i.e., leads to ST gaze mismatch) (Saeedi, 2021).

Into German

- SG12: from-scratch translating (T), post-editing (P) and editing (E), 31 translation students (Nitzke 2019, Carl et al 2015)
- ADU17: revision (R), 39 translation students (Schaeffer et al 2019)
- SJM16: reading (L) and copying, (C) , 50 German translation students (Schaeffer and Carl. 2017)

Into Hindi

- NJ12: from-scratch translating (T), post-editing (P) and editing (E) by professional translators
- WARDHA13: from-scratch translating (T), post-editing (P) and editing (E) by students (many bugs)

Into Japanese

- ENJA15: translation dictation (D), post-editing (P), translation (T) , 39 participants (Carl et al, 2016, Ogawa 2021)
- JAMT19: alignment of 13 different MT systems (See ARMT19, ESMT19, ZHMT19) (Ogawa et al 2021)

Into Spanish

- BML12: from-scratch translating (T), post-editing (P) and editing (E) (Mesa-Lao, 2012, Mesa-Lao, 2013)
- BML12_re: manually re-aligned BML12 study following consistent guidelines (Gilbert et al 2022)
  there are various automatically re-aligned versions with SimAlign (https://arxiv.org/abs/2004.08728), 3 different alignment methods:
  - - BML12_SI: intermax (I) (recommended) alignment method
    - BML12_SA: argmax (A) alignment method
    - BML12_SM: match (M) alignment method
    - original BML12 SMT output (google from 2012) Sim-aligned: BML12_MT_SI, BML12_MT_SA, BML12_MT_SM
    - 10 different (N)MT systems (from 2021) Sim-aligned: BML12_NTO_SI, BML12_NTO_SA, BML12_NTO_SM
- ESMT19: alignment of 9 different MT systems (from 2019, see also ARMT19, JAMT19, ZHMT19) (Gilbert 2021, Ogawa et al 2021)
- MPM16: MT error annotation by 8 professional translators (Carl and Báez. 2019.)
- SPC15: from-scratch translating (T), texts 4 and 6

Into Spanish with permuted segments

- MP16: non-coherent post-editing i.e. in permuted order (Báez; Schaeffer; Carl. 2018)

missionStatements

13 mission statements from different companies in English, each of which contains approximately 160-190 words. Can be downloaded from here

- HNUJms: English-Chinese; from-scratch translating (T) and post-editing (P), only texts 4, 5, 6 (Jia et. al, 2019)
- JTD16: English-Japanese; translation dictation (D) (7 participants, translated 12 texts on 6 successive days)
- GV18: English-German; from-scratch translating with retrospective protocol (TA) and without (TB) , 22 translation students
- ENTP19: copying (C), paraphrasing (H), 4 participants, first 6 texts

ministerSpeech

This is a 44:50 minutes speech of the Minister for Foreign Affairs of Australia during her visit to Japan in 2014. This English speech was made at the National Press Club, Tokyo, Japan. The beginning of the speech is segmented into 6 short texts (approx 1 min each). The source text data (Translog-II project files, Videos, wav files, and some background information) can be downloaded from this link.

Into Chinese

- IMBi18 : 9 professional interpreters simultaneous interpretation with source text (ST), ear-voice alignment
- IMBi18bolt: the IMBi18 study, re-aligned according to the BOLT guidelines
- IMBst18: IMB study, simultaneous interpretation with source text (ST), eye-voice alignment
- IMBst18bolt: the IMBst18 study, re-aligned according to the BOLT guidelines
- XIANG19: from-scratch translation, 10 translations students

Into German

- ST19: copying (C), reading loud (LV), translation (T), sight translation (S), 12 translation students

diverse

Studies that use different source texts:

Brazilian Portuguese, English

- JLG10: English <-> Brazilian Portuguese: L1 and L2 translations from/to English and Brazilian Portuguese.

Chinese, English

- CET6: Chinese-English: 22 Chinese BA students translation texts into L2 with eye-tracking (Liu, 2018)
- JIN15: Data from Jin Huang's PhD dissertation (Huang, 2018)
- HNUJd: Data from Yangfang Jia's PhD dissertation (other texts; see HNUJml and HNUJms)
- LiTian2019New: English-to-Chinese, translation & copy 4 texts, 64 sessions (Sanjun Sun et al. 2021)
- MAecho2019: English-to-Chinese, translation & copy 128 sessions (Sanjun Sun et al. 2021)

Chinese, Portuguese

- ZHPT12: Chinese-Portuguese: translating (T) journalistic texts with non-literal (metaphoric) expressions.
- MS13: Brazilian Portuguese <-> Chinese: translating and post-editing in both language directions.

Danish, English

- ACS08: English-Danish: from-scratch translation (T) and copying (C) of non-literal expressions. (Sjørup, 2013)
- BD08: English-Danish: from-scratch translation, professional translators. (Dragsted, 2010)
- BD13: English-Danish: secondary school students translating and post-editing
- LWB09: Danish[L1]-English[L2]: eye tracking experiment with professional translators. (Jensen et al 2009)

English, Dutch (nl)

- predict20: English-Dutch: from-scratch translation (T) and copying (C) of non-literal expressions. (Vanroy, 2021)

Estonian, English

- HLR13: English-Estonian: 5 participants translating 3 different texts.

Polish, French

- DG01: Polish-French: professional and non-professional translators with and without a representation of the text (Płońska, 2015)

German, English

- AE17: English-German from-scratch translating, (T), 12 students, 20 texts (Heilmann, )
- AU20: English-German from-scratch translating (T) of popular-scientific news texts in a translation lab, 11 professional translators (Heilmann et al. 2022)
- HE17: from-scratch translating (T) of popular-scientific news texts in a translation lab, 12 professional translators (Freiwald et al. 2020)
- CPH17: English-German from-scratch translating, (T) 41 students

Spanish, English

- GS12: Spanish-English: post-editing of four pieces of news.
- R H12: authoring (A) study for the production of news by two Spanish journalists.

Trados Data

sessions recorded with Trados. Follow instructions here to download the data. For a description of the Trados logging tool and conversion into TPR-DB, see Zou, L., et al. (2023), .Yamada et al (2022), or Zou, L., & Carl, M. (2022).

Data used in Vieira, et al (2023) Translating science fiction in a CAT tool: machine translation and segmentation settings. Translation & Interpreting Vol. 15 No. 1

CREATIVE: English - Chinese Literature translation vs. postediting (13 professional translators)
CREATIVE2: English - Chinese Literature sentence postediting vs. paragraph postediting (11 professional translators)

Data used in Zou et al (2022):

ATJA22: English-Japanese different translation briefs / expert translators (7 experienced translators)
ATZH22: English-Chinese different translation briefs / novice translators (5 novice translators)

Data used in Gilbert (2022), PhD thesis:

- DG21: English-Spanish post-editing, 36 professional translators, (subset of LS14) with automatic and manual highlighted MT errors (P, PA and PM, respectively)
- DG21error: manual error annotation of the post-edited versions

CASMACAT data

Studies conducted with the CASMACAT workbenches include. Follow instructions here to download the data:

- ALG14: This study compares professional translator and bilinguals while post-editing with the third prototype of the CASMACAT workbench featuring visualization of word alignments. (Alabau et al, 2014)
- CEMPT13: This study contains post-editing recordings with the second prototype of the CASMACAT workbench, featuring interactive machine translation. (Alves et al, 2015)
- CFT12: This study contains data of the ﬁrst CASMACAT ﬁeld trial from June 2012, comparing post-editing with from-scratch translation. (Elming et al, 2014)
- CFT13: This study contains data of the second CASMACAT ﬁeld trial from June 2013, comparing post-editing and interactive machine translation. (Carl et al. 2014)
- CFT14: This study contains data of the second CASMACAT ﬁeld trial from June 2014, comparing interactive machine translation and online learning. (Alabau et al, 2014, Zapata 2015)
- EFT14: The study compares active and online learning during interactive translation prediction (Ortiz-Martínez et al 2015)
- JN13: This study is recorded with the second prototype of the CASMACAT workbench featuring interactive machine translation and word alignments. (Nitzke and Oster, 2015)
- LS14: This study investigates learning effects with interactive post-editing over a period of six week with the third prototype of the CASMACAT workbench. (Alabau et al, 2014, Alabau et al 2015)
- PFT13: This study is a pre-ﬁeld trial test prior to the second CASMACAT ﬁeld trial.
- PFT14: This study is a pre-ﬁeld trial test prior to the third CASMACAT ﬁeld trial.
- ROBOT14: This study investigates usage of external resources during translation and post-editing (Daems et al, 2015)

Data from the CASMACAT field trial 2014 (CFT14):

Seven post-editors post-editing each two texts in plain post-editing mode and under active learning conditions. Revisions of the post-edited texts with hand-writing recognition.

- CFT14 raw logging data

Data from the CASMACAT longitudinal study 2014 (LS14):

Five post-editors post-editing each 24 files during a periode of 6 weeks between May and June 2014 in plain post-editing and interactive mode. In all 120 translation sessions from which 35 are with recorded gaze data

- LS14 raw logging data

Data from the CASMACAT field trial 2013 (CFT13):

Logging data of 81 post-editing and revision sessions, more than 120 hours of user activity data, recorded with CASMACAT workbench v.2.0:

- CFT13 raw logging data
- CFT13 segments (v 1.4) - segment editing and revision information
  - For a description of the features see: Michael Carl, Mercedes Martínez García and Bartolomé Mesa-Lao (2014) "CFT13: A new resource for research into the post-editing process", In Proceedings of LREC 2014
- CFT13 segments (v 1.3) - segment editing and revision information
- CFT13 TPR-DB (v1.2) - compiled into TPR-DB format (available upon request (mc.ibc@cbs.dk)
  - more recent versions are contained in the TPR-DB
- CFT13 videos - videos available upon request (mc.ibc@cbs.dk)

Data from the CASMACAT pre-field trial 2013 (PFT13):

This data was recorded with CASMACAT workbench v.2.0:

- PFT13 - raw data
- PFT13 - compiled into TPR-DB format

Data from the CASMACAT field trial 2012 (CFT12):

Logging data of 89 translation sessions English -> Spanish recorded with the CASMACAT workbench v.1.0

- CFT12 - raw logging data
- CFT12 - compiled into TPR-DB format (available upon request (mc.ibc@cbs.dk)
- CFT12 - link to the enriched data as prepared by the group in Wolverhampton.