CRITT TPR-DB

The CRITT Translation Process Database (TPR-DB) is a publicly available database of recorded translation sessions for Translation Process Research (TPR). It contains user activity data (UAD) of translators behavior collected in approximately 30 translation (and text production) studies with Translog-II and with the CASMACAT workbench. This data acquisition software logs keystrokes and gaze data during text perception and text production. The data currently amounts to more than 500 hours of text production gathered in more than 1400 sessions. In addition to the raw logging data, a post-processed database (TPR-DB) is made available which compiles this data into a set of tab separated tables that can be more easily processed by various visualization and analysis tools.


TPR-DB Versions

The TPR-DB consists of several studies that were conducted either with the CASMACAT workbench or with Translog. Raw logging and aligned data for all sessions is available on sourceforge https://sourceforge.net/projects/tprdb/ and can be checked out via svn (approx. 50 GB!)

On Linux (or cygwin):
- svn checkout --username=tprdb https://svn.code.sf.net/p/tprdb/svn/ tprdb-svn

On Windows:
- Install Tortoise svn e.g. from https://tortoisesvn.net/downloads.html 
- right-click in a directory where you want to install the TPR-DB, select "SVN checkout", paste  https://svn.code.sf.net/p/tprdb/svn/  into the URL field and click OK.
- You can also specify which folders to download e.g 
 https://svn.code.sf.net/p/tprdb/svn/bin

Post-processed and compiled versions of the data consists of several summary files per study. All summary tables are zipped and can be downloaded from here: https://sourceforge.net/projects/tprdb/files/

TPR-DB Studies

The study names links to the raw data, in brackets are indicated primary references that make use of this resource.Studies that were conducted with Translog: 
  • ACS08: This study explores the way in which translators process the meaning of non-literal expressions by investigating the gaze times associated with these expressions.  (Sjørup, 2013)  
  • BD08: This study involves Danish professional translators working from English into Danish. (Dragsted, 2010)
  • BD13: This study involves secondary school students translating and post-editing from English into Danish.
  • DG01: The study compares students, professional and non-professional translators with and without a representation of the text (Płońska, 2015)
  • GS12: This study contains post-editing data of four pieces of news from Spanish into English.
  • HLR13: This is a translation study from English into Estonian (5 participants translating 3 different texts).
  • JIN15: Data of Jin Huang's PhD thesis
  • JLG10: This study investigates L1 and L2 translations from/to English and Brazilian Portuguese.
  • JTD16: Investigate the learning effects in Japanese translation dictation (7 participants, translated 12 texts on 6 successive days) 
  • LWB09: This study reports on an eye tracking experiment in which professional translators were asked to translate two texts from L1 Danish into L2 English. (Jensen et al 2009)
  • MS13: This study is an investigation of translator’s behaviour when translating and post-editing Portuguese and Chinese in both language directions. 
  • RH12: This is an authoring study for the production of news by two Spanish journalists.
  • ZHPT12: This study investigates translator’s behaviour when translating journalistic texts. The specific aim is to explore translation process research while processing non-literal (metaphoric) expressions. 
The aim of the multiLing experiment is to compare from-scratch translation (T), post-editing (P), monolingual post-editing (E), and translation dictation (D) for different translators and for different languages. The six English source texts are translated by student and experienced translators; four texts (texts 1-4) are news, and two texts (5-6) are sociological texts from an encyclopedia. Texts are permuted in a systematic manner so as to make sure that each text was translated by every translator and every translator translated two different texts in each translation mode. 

multiLing dataset:
  • BML12: This study contains translating, post-editing and editing data of six texts from English into Spanish. (Mesa-Lao, 2012, Mesa-Lao, 2013)
  • ENJA15: The goal of this study was to compare Englich-to-Japanese translation dictation, post-editing and from-scratch translation  (Carl et al, 2016)
  • KTHJ08: This study contains only translation data for the news text 1-3. (Hvelplund 2011)
  • MS12: This study contains translating, post-editing and editing of the six texts English into Chinese.
  • NJ12: This study contains translating, post-editing and editing of the six texts English into Hindi by professional translators.
  • SG12: This study contains translating, post-editing and editing of the six texts English into German. (Carl et al 2015)
  • RUC17: contains translating, post-editing and editing of the six texts English into Chinese by 22 translation students.
the following two studies are based on the same six English source texts but are not part of the multiLing dataset:
  • TDA14: In this study participants were asked to copying the six English texts (Bangalore et al, 2015).
  • WARDHA13: This study contains translating, post-editing and editing of the six texts English into Hindi by students.
Studies conducted with the CASMACAT workbenches include: 
  • ALG14: This study compares professional translator and bilinguals while post-editing with the third prototype of the CASMACAT workbench featuring visualization of word alignments. (Alabau et al, 2014)
  • CEMPT13: This study contains post-editing recordings with the second prototype of the CASMACAT workbench, featuring interactive machine translation. (Alves et al, 2015)
  • CFT12: This study contains data of the first CASMACAT field trial from June 2012, comparing post-editing with from-scratch translation. (Elming et al, 2014)
  • CFT13: This study contains data of the second CASMACAT field trial from June 2013, comparing post-editing and interactive machine translation. (Carl et al. 2014)
  • CFT14: This study contains data of the second CASMACAT field trial from June 2014, comparing interactive machine translation and online learning. (Alabau et al, 2014, Zapata 2015)
  • EFT14: The study compares active and online learning during interactive translation prediction (Ortiz-Martínez et al 2015)
  • JN13: This study is recorded with the second prototype of the CASMACAT workbench featuring interactive machine translation and word alignments. (Nitzke and Oster, 2015)
  • LS14: This study investigates learning effects with interactive post-editing over a period of six week with the third prototype of the CASMACAT workbench. (Alabau et al, 2014, Alabau et al 2015)
  • PFT13: This study is a pre-field trial test prior to the second CASMACAT field trial.
  • PFT14: This study is a pre-field trial test prior to the third CASMACAT field trial.
  • ROBOT14: This study investigates usage of external resources during translation and post-editing (Daems et al, 2015)

CASMACAT UAD data

Data from the CASMACAT field trial 2014 (CFT14):

Seven post-editors post-editing each two texts in plain post-editing mode and under active learning conditions. Revisions of the post-edited texts with hand-writing recognition.

Data from the CASMACAT longitudinal study 2014 (LS14):

Five post-editors post-editing each 24 files during a periode of 6 weeks between May and June 2014 in plain post-editing and interactive mode. In all 120 translation sessions from which 35 are with recorded gaze data

Data from the CASMACAT field trial 2013 (CFT13):

Logging data of 81 post-editing and revision sessions, more than 120 hours of user activity data, recorded with CASMACAT workbench v.2.0:

  • CFT13 raw logging data
  • CFT13 segments (v 1.4) - segment editing and revision information
    For a description of the features see: Michael Carl, Mercedes Martínez García and Bartolomé Mesa-Lao (2014) "CFT13: A new resource for research into the post-editing process", In Proceedings of LREC 2014
  • CFT13 segments (v 1.3) - segment editing and revision information 
  • CFT13 TPR-DB (v1.2)  - compiled into TPR-DB format (available upon request (mc.ibc@cbs.dk)
    more recent versions are contained in the TPR-DB
  • CFT13 videos - videos available upon request (mc.ibc@cbs.dk)

Data from the CASMACAT pre-field trial 2013 (PFT13):

This data was recorded with CASMACAT workbench v.2.0:

Data from the CASMACAT field trial 2012 (CFT12):

Logging data of 89 translation sessions English -> Spanish recorded with the CASMACAT workbench v.1.0

  • CFT12 - raw logging data
  • CFT12 - compiled into TPR-DB format (available upon request (mc.ibc@cbs.dk)
  • CFT12 - link to the enriched data as prepared by the group in Wolverhampton.

TPR-DB Documentation

For a documentation how to extract and convert the raw logging data into the TPR-DB format, read this document. The document describes how to run the scripts in the "bin" in the TPR study folder. The database compilation process also requires external tools & resources:

  • YAWAT: Yet Another Word Alignment Tool is a browser-based tool for manual word alignment. The YAWAT website visualizes the segment and word alignments of the entire TPR-DB and requires a password which can be obtained from mc.ibc@cbs.dk. The following paper explains what Yawat is all about, and how to use it.
  • JDTAG is a Java-based tool for manual word alignment and alignment correction, in function similar to YAWAT. JDTAG can read the atag file format as is contained in the Alignment folder in the TPR-DB (version 1.0)  and in raw data (please send an e-mail to mc.ibc@cbs.dk in case you want to have access to JDTAG).


Overview over contents of studies in the TPR-DB (v2.265)









Study

Number Sessions

Source Language

Target Language

Task

Number Participants

Source Tokens

Target Tokens

ACS08

30

en

da

T

17

5085

5075

ACS08

30

en

en

C

17

5099

5109

ALG14

8

en

es

P

8

4460

4807

ALG14

8

en

es

PA

8

4460

4801

BD08

10

en

da

T

10

1100

1056

BD13

8

en

da

T

8

786

751

BD13

10

en

da

P

10

970

1014

BML12

64

en

es

P

32

9012

10216

BML12

63

en

es

T

32

8936

10102

BML12

60

en

es

E

30

8468

9594

CEMPT13

20

en

pt

PIA

20

6706

6840

CEMPT13

20

en

pt

P

20

6494

6585

CFT13

27

en

es

R

4

26919

28738

CFT13

27

en

es

PI

9

31752

33871

CFT13

27

en

es

P

9

31294

33770

CFT13

27

en

es

PIA

9

31838

34047

CFT14

7

en

es

RE

3

20341

22015

CFT14

7

en

es

R

4

20273

22251

CFT14

7

en

es

P

7

20273

22067

CFT14

7

en

es

PIO

7

20341

22284

DG01

60

fr

pl

T

60

25380

20329

EFT14

11

en

es

PIVO

11

12437

13549

EFT14

11

en

es

PI

11

12437

13696

EFT14

10

en

es

PIVA

10

11327

12472

GS12

8

es

en

P

4

2482   

2383

HLR13

15

en

et

T

5

1535

1186

JIN15

18

en

zh

S

18

1947

1728

JIN15

18

en

zh

P

18

1998

1845

JIN15

17

en

zh

R

17

1946

1833

JLG10

10

en

pt

T

5

2577

2781

JLG10

10

pt

en

T

5

2611

2621

JN13

4

en

de

PIA

4

2590

2668

JN13

4

en

de

P

4

2590

2571

KTHJ08

69

en

da

T

24

10571

10667

LS14

60

en

es

PI

5

72109

80278

LS14

60

en

es

P

5

72126

80454

LWB09

40

da

en

T

18

5652

6206

MS12

19

en

zh

P

11

2708

2562

MS12

15

en

zh

T

10

2061

1916

MS12

10

en

zh

E

8

1295

1203

MS13

16

zh

pt

P

16

1410

1648

MS13

16

pt

zh

T

16

1386

1378

MS13

22

zh

pt

T

22

1938

2216

MS13

18

pt

zh

P

18

1555

1507

NJ12

39

en

hi

T

20

5505

5784

NJ12

61

en

hi

P

20

8581

9365

PFT13

9

en

es

P

9

3035

3144

PFT13

19

en

es

PI

19

6689

7437

PFT13

16

en

es

PIC

16

5396

5147

PFT13

15

en

es

PIO

15

4611

4666

PFT13

16

en

es

PIL

16

5572

5344

PFT14

3

en

es

PIVO

3

3245

3150

PFT14

2

en

es

PIVA

2

2286

2184

PFT14

2

en

es

PIV

2

2161

2077

RH12

2

es

es

A

2

1207

1207

ROBOT14

40

en

nl

P

10

7375

7527

ROBOT14

40

en

nl

T

10

7375

7329

SG12

46

en

de

E

23

6522

6741

SG12

45

en

de

P

23

6352

6470

SG12

47

en

de

T

24

6632

6777

TDA14

48

en

en

C

8

6792

6779

WARDHA13

34

en

hi

T

18

4832

4790

WARDHA13

31

hi

hi

C

18

4365

4104

WARDHA13

27

en

hi

P

15

3780

4016

ZHPT12

12

zh

pt

T

12

1104

1603

Total

1562

7

9

15

874

620210

657948


Visualizing the TPR-DB 


Generating the TPR-DB 


Google Group: TPR2011

TPR2011: This group was was created at the CRITT TPR summer school 2011. The idea is to discuss Translation Processes Research related themes, e.g. theoretical, methodological and experimental issues, data visualization and human translation process modeling, qualitative and quantitative data analysis etc.


License

Creative Commons License

The CRITT Translation Process Research Database (TPR-DB) by the Center for Research and Innovation in Translation and Translation Technology is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

We would like to thank all contributors and participants for their work.

Subpages (1): License
Ċ
Barto Mesa,
28 Feb 2015, 14:04
Ċ
Michael Carl,
27 Feb 2015, 10:45