Public Studies
Download public TPR-DB data
The CRITT TPR-DB consists of studies that were conducted with Translog-II, Trados or with the CASMACAT workbench. Compiled summary tables can be download via the CRITT TPR-DB management tool:
go to login page: https://critt.as.kent.edu/cgi-bin/yawat/yawat.cgi
login as public user: TPRDB password: tprdb
change to management page: https://critt.as.kent.edu/cgi-bin/yawat/tpd.cgi
click on the Download buttons for Tables, Alignment or Translog-II (raw logging data for all data acquisition tools) of the studies that you are interested in.
The Raw logging and aligned data for all public sessions are available on sourceforge https://sourceforge.net/projects/tprdb/ and can be checked out via svn:
svn checkout --username=tprdb https://svn.code.sf.net/p/tprdb/svn/tprdb-svn
Post your technical, methodological, and theoretical questions and comments here.
Visualize Translation Progression Graphs
select the study and session you want to visualize
click on "Load selected session"
adjust time and text segment for which you want to inspect behavioral data
you can also visualize different types of units to fragment the stream of data
or/and look at (some of the) TPR-DB tables for the selected segment
Translation Progression Graphs for public studies can be interactively scrutinized by following this link:
note: this works with http, not with https. It works with the MS-Edge browser.
Studies that were conducted with Translog-II. The study names links to the raw data, in brackets are indicated primary references that make use of this resource. The references should be spelled out on the CRITT publication website.
The TPR-DB contains several collections of studies which make use of the same English source texts. These source texts are identically tokenized and segmented across all studies, but are used in different text production (or reception) modes and text production into different languages.
multiLing
The multiLing data set is based on six English source texts which are translated into various languages. Four of them (Texts 1-4) are news articles and the other two are (Texts 5-6) sociological texts from an encyclopedia. The Data can be downloaded from here (User: TPRBD, passwd: tprdb). The source text data (project files for Translog-II, tokenized *src files, and texts) can be downloaded from this link. Publications that refer to the data are given in brackets.
Into Arabic
Into Chinese
HNUJml: from-scratch translating (T) and post-editing (P), only texts 1 and 2 (Jia et. al, 2019)
MS12: post-editing (P) and editing (E) (many sessions missing)
RUC17: from-scratch translating (T) and post-editing (P), 21 MTI students
RUCMT17: google MT error annotation by 16 translation students (Carl and Báez. 2019.)
STC17: from-scratch translation (T), post-editing (P), sight translation (ST) , 16 MTI students
STC17bolt: the STC17 study, re-aligned according to the BOLT guidelines
STML18: sight translation (ST), 9 professional interpreters
STML18bolt: the STML18 study, re-aligned according to the BOLT guidelines
ZHMT19: alignment of 12 different MT systems (See also ARMT19, ESMT19, JAMT19)
Into Danish
KTHJ08: from-scratch translation (T) 12 students and 12 professionals, only news text 1-3 (Hvelplund 2011)
Into Dutch
ENDU20: Ten native Dutch recent master’s degree translators, only keyboard logging with Translog-II. (Vanroy, 2021)
ENDU20-MT: Two Dutch MT DeepL (P20) and Google Translate (P21) from 2020. (Vanroy, 2021)
Into English
TDA14: copying (C), 16 German translation students (Bangalore et al, 2015).
CS19: copying (C), paraphrasing (H), summarization (U), 13 computer sciences students (Sahoo & Carl 2019)
PARAP: copying (C), paraphrasing (H), computer sciences students computer sciences students
BACK2020: PhD students' back-translation 9*via Arabic and 4*via Chinese (P08, P11, P13, P15). P03 and P15 are MT. Source texts are substituted (i.e., leads to ST gaze mismatch) (Saeedi, 2021).
Into German
SG12: from-scratch translating (T), post-editing (P) and editing (E), 31 translation students (Nitzke 2019, Carl et al 2015)
ADU17: revision (R), 39 translation students (Schaeffer et al 2019)
SJM16: reading (L) and copying, (C) , 50 German translation students (Schaeffer and Carl. 2017)
Into Hindi
Into Japanese
ENJA15: translation dictation (D), post-editing (P), translation (T) , 39 participants (Carl et al, 2016, Ogawa 2021)
JAMT19: alignment of 13 different MT systems (See ARMT19, ESMT19, ZHMT19) (Ogawa et al 2021)
Into Spanish
BML12: from-scratch translating (T), post-editing (P) and editing (E) (Mesa-Lao, 2012, Mesa-Lao, 2013)
BML12_re: manually re-aligned BML12 study following consistent guidelines (Gilbert et al 2022)
there are various automatically re-aligned versions with SimAlign (https://arxiv.org/abs/2004.08728), 3 different alignment methods:BML12_SI: intermax (I) (recommended) alignment method
BML12_SA: argmax (A) alignment method
BML12_SM: match (M) alignment method
original BML12 SMT output (google from 2012) Sim-aligned: BML12_MT_SI, BML12_MT_SA, BML12_MT_SM
10 different (N)MT systems (from 2021) Sim-aligned: BML12_NTO_SI, BML12_NTO_SA, BML12_NTO_SM
ESMT19: alignment of 9 different MT systems (from 2019, see also ARMT19, JAMT19, ZHMT19) (Gilbert 2021, Ogawa et al 2021)
MPM16: MT error annotation by 8 professional translators (Carl and Báez. 2019.)
SPC15: from-scratch translating (T), texts 4 and 6
Into Spanish with permuted segments
MP16: non-coherent post-editing i.e. in permuted order (Báez; Schaeffer; Carl. 2018)
missionStatements
13 mission statements from different companies in English, each of which contains approximately 160-190 words. Can be downloaded from here
HNUJms: English-Chinese; from-scratch translating (T) and post-editing (P), only texts 4, 5, 6 (Jia et. al, 2019)
JTD16: English-Japanese; translation dictation (D) (7 participants, translated 12 texts on 6 successive days)
GV18: English-German; from-scratch translating with retrospective protocol (TA) and without (TB) , 22 translation students
ENTP19: copying (C), paraphrasing (H), 4 participants, first 6 texts
ministerSpeech
This is a 44:50 minutes speech of the Minister for Foreign Affairs of Australia during her visit to Japan in 2014. This English speech was made at the National Press Club, Tokyo, Japan. The beginning of the speech is segmented into 6 short texts (approx 1 min each). The source text data (Translog-II project files, Videos, wav files, and some background information) can be downloaded from this link.
Into Chinese
IMBi18 : 9 professional interpreters simultaneous interpretation with source text (ST), ear-voice alignment
IMBi18bolt: the IMBi18 study, re-aligned according to the BOLT guidelines
IMBst18: IMB study, simultaneous interpretation with source text (ST), eye-voice alignment
IMBst18bolt: the IMBst18 study, re-aligned according to the BOLT guidelines
XIANG19: from-scratch translation, 10 translations students
Into German
ST19: copying (C), reading loud (LV), translation (T), sight translation (S), 12 translation students
diverse
Studies that use different source texts:
Brazilian Portuguese, English
JLG10: English <-> Brazilian Portuguese: L1 and L2 translations from/to English and Brazilian Portuguese.
Chinese, English
CET6: Chinese-English: 22 Chinese BA students translation texts into L2 with eye-tracking (Liu, 2018)
JIN15: Data from Jin Huang's PhD dissertation (Huang, 2018)
HNUJd: Data from Yangfang Jia's PhD dissertation (other texts; see HNUJml and HNUJms)
LiTian2019New: English-to-Chinese, translation & copy 4 texts, 64 sessions (Sanjun Sun et al. 2021)
MAecho2019: English-to-Chinese, translation & copy 128 sessions (Sanjun Sun et al. 2021)
Chinese, Portuguese
Danish, English
ACS08: English-Danish: from-scratch translation (T) and copying (C) of non-literal expressions. (Sjørup, 2013)
BD08: English-Danish: from-scratch translation, professional translators. (Dragsted, 2010)
BD13: English-Danish: secondary school students translating and post-editing
LWB09: Danish[L1]-English[L2]: eye tracking experiment with professional translators. (Jensen et al 2009)
English, Dutch (nl)
predict20: English-Dutch: from-scratch translation (T) and copying (C) of non-literal expressions. (Vanroy, 2021)
Estonian, English
HLR13: English-Estonian: 5 participants translating 3 different texts.
Polish, French
DG01: Polish-French: professional and non-professional translators with and without a representation of the text (Płońska, 2015)
German, English
AE17: English-German from-scratch translating, (T), 12 students, 20 texts (Heilmann, )
AU20: English-German from-scratch translating (T) of popular-scientific news texts in a translation lab, 11 professional translators (Heilmann et al. 2022)
HE17: from-scratch translating (T) of popular-scientific news texts in a translation lab, 12 professional translators (Freiwald et al. 2020)
CPH17: English-German from-scratch translating, (T) 41 students
Spanish, English
Trados Data
sessions recorded with Trados. Follow instructions here to download the data. For a description of the Trados logging tool and conversion into TPR-DB, see Zou, L., et al. (2023), .Yamada et al (2022), or Zou, L., & Carl, M. (2022).
Data used in Vieira, et al (2023) Translating science fiction in a CAT tool: machine translation and segmentation settings. Translation & Interpreting Vol. 15 No. 1
CREATIVE: English - Chinese Literature translation vs. postediting (13 professional translators)
CREATIVE2: English - Chinese Literature sentence postediting vs. paragraph postediting (11 professional translators)
Data used in Zou et al (2022):
ATJA22: English-Japanese different translation briefs / expert translators (7 experienced translators)
ATZH22: English-Chinese different translation briefs / novice translators (5 novice translators)
Data used in Gilbert (2022), PhD thesis:
DG21: English-Spanish post-editing, 36 professional translators, (subset of LS14) with automatic and manual highlighted MT errors (P, PA and PM, respectively)
DG21error: manual error annotation of the post-edited versions
Studies conducted with the CASMACAT workbenches include. Follow instructions here to download the data:
ALG14: This study compares professional translator and bilinguals while post-editing with the third prototype of the CASMACAT workbench featuring visualization of word alignments. (Alabau et al, 2014)
CEMPT13: This study contains post-editing recordings with the second prototype of the CASMACAT workbench, featuring interactive machine translation. (Alves et al, 2015)
CFT12: This study contains data of the first CASMACAT field trial from June 2012, comparing post-editing with from-scratch translation. (Elming et al, 2014)
CFT13: This study contains data of the second CASMACAT field trial from June 2013, comparing post-editing and interactive machine translation. (Carl et al. 2014)
CFT14: This study contains data of the second CASMACAT field trial from June 2014, comparing interactive machine translation and online learning. (Alabau et al, 2014, Zapata 2015)
EFT14: The study compares active and online learning during interactive translation prediction (Ortiz-Martínez et al 2015)
JN13: This study is recorded with the second prototype of the CASMACAT workbench featuring interactive machine translation and word alignments. (Nitzke and Oster, 2015)
LS14: This study investigates learning effects with interactive post-editing over a period of six week with the third prototype of the CASMACAT workbench. (Alabau et al, 2014, Alabau et al 2015)
PFT13: This study is a pre-field trial test prior to the second CASMACAT field trial.
PFT14: This study is a pre-field trial test prior to the third CASMACAT field trial.
ROBOT14: This study investigates usage of external resources during translation and post-editing (Daems et al, 2015)
Data from the CASMACAT field trial 2014 (CFT14):
Seven post-editors post-editing each two texts in plain post-editing mode and under active learning conditions. Revisions of the post-edited texts with hand-writing recognition.
Data from the CASMACAT longitudinal study 2014 (LS14):
Five post-editors post-editing each 24 files during a periode of 6 weeks between May and June 2014 in plain post-editing and interactive mode. In all 120 translation sessions from which 35 are with recorded gaze data
Data from the CASMACAT field trial 2013 (CFT13):
Logging data of 81 post-editing and revision sessions, more than 120 hours of user activity data, recorded with CASMACAT workbench v.2.0:
CFT13 segments (v 1.4) - segment editing and revision information
For a description of the features see: Michael Carl, Mercedes Martínez García and Bartolomé Mesa-Lao (2014) "CFT13: A new resource for research into the post-editing process", In Proceedings of LREC 2014
CFT13 segments (v 1.3) - segment editing and revision information
CFT13 TPR-DB (v1.2) - compiled into TPR-DB format (available upon request (mc.ibc@cbs.dk)
more recent versions are contained in the TPR-DB
CFT13 videos - videos available upon request (mc.ibc@cbs.dk)
Data from the CASMACAT pre-field trial 2013 (PFT13):
This data was recorded with CASMACAT workbench v.2.0:
PFT13 - compiled into TPR-DB format
Data from the CASMACAT field trial 2012 (CFT12):
Logging data of 89 translation sessions English -> Spanish recorded with the CASMACAT workbench v.1.0
CFT12 - raw logging data
CFT12 - compiled into TPR-DB format (available upon request (mc.ibc@cbs.dk)
CFT12 - link to the enriched data as prepared by the group in Wolverhampton.