Uploading Data

Prepare TPR-DB session files

TPR-DB naming conventions

Adding Language tag

Upload and process Studies

Upload a new study

Add data to available collections

Uploading Trados data to the TPR-DB

Uploading PET data to the TPR-DB

Speech data and the CRITT TPRDB

Generate a TPR-DB with perl scripts

Preparing Windows

Tokenize and align texts

Generate tables

The TPR-DB management tool is a a browser interface through which one can upload and download raw logging data and generate TPD-DB tables. Data can be downloaded from the public section. A 'private' account is required for uploading and processing data in the CRITT TPR-DB. Interested researchers can request a CRITT membership.

TPR-DB tables facilitate the analysis of the logging data through a large set of pre-defined features. This management tool allows you to:

- Upload translation data gathered with Translog-II (Translation study)
- Align and error-annotate translation sessions
- Generate and download TPR-DB tables

To upload your study, it is crucial that you prepare your files according to the TPR-DB conventions. Follow the steps below and post your technical, methodological, and theoretical questions and comments here if something is unclear.

Post your technical, methodological, and theoretical questions and comments here.

Prepare TPR-DB session files

TPR-DB naming conventions

The TPR-DB is an anonymized repository of logged translation sessions and contains a number of different modes, such as post-editing of machine translation and interactive post-editing (e.g. CASMACAT project), in addition to human translations.

The logged UAD is contained in one single file, where the file name contains important variables of the translation study. For example:

- P01_T1.xml: participant #1 translated text 1
- P01_P2.xml: participant #1 post-edited text 2
- P15_C3.xml: participant #15 copied text 3

That is, Participant, Task, Text, in the following format:

- Participants are numbered P01 to Pn
- Task codes:*
  - - T: from scratch Translation
    - MT: machine translation output
    - P: Post-Editing of Machine Translation
    - E Monolingual Editing (Post-Editing without Source Text)
    - C: Text Copying
    - R: Translation Revision
    - D: for Dictation (with realtime ASR; e.g. Dragon Speech)
    - S: Spoken Translation production (e.g. transcribed sight translation)
    - I: Simultaneous Interpreting
    - ST: simultaneous interpretation with source text
    - A: Authoring (monolingual text production)
    - L: Reading
    - LV: reading aloud (for German vorlesen)
    - U: monolingual summarization
    - H: monolingual paraphrasing
- Texts are numbered 1 to m

Adding Language tag

Before uploading a Translog-II log file through the TPR-DB management tool you must insert a language tag. The server can then know which are the source and the target languages and process files accordingly: Open the xml file and insert a line <Languages … /> with the source and target languages in the session, at the position indicated below. IMPORTANT: Make sure to use the correct quotes, and no additional spaces between attributes and values!

...

<fullScreen>false</fullScreen>

<lockWindows>false</lockWindows>

<Key_Logger />

...

Upload and process Studies

You need an account to generate a TPR-DB from the management tool with your own study. Contact m.gummiball[at]gmail.com to obtain an account that allows you to upload your studies and to change the word alignments. You can also generate a TPR-DB from logging data using perl scripts.

Log in to the TPR-DB management tool

- Open Browser Mozilla Firefox (avoid Chrome or IExplorer)
- Log in to the YAWAT: https://critt.as.kent.edu/cgi-bin/yawat/yawat.cgi
- Provide login name and password
- Change the address in the address bar manually to: https://critt.as.kent.edu/cgi-bin/yawat/tpd.cgi
- You can upload and download studies and generate/download TPR-DB tables there

The appearance of the TPR-DB management tool

Upload a new study

1. Make sure that
  1. - your Translog-II log files follow the naming convention (e.g. P01_T5.xml, see "naming conventions" below).
    - your log files contain correct source and target language tags (see "insert language tag" below).
2. Zip your Translog-II xml log files, and alignment files (*src, *tgt, *atag) if available. Do not use any other compression tool.
3. On the management tool, click on the "Browse" button and locate the zip file.
4. Provide a study name (e.g. bella in the screenshot below) which should consist of letters and/or digits. Do not use a space in the name.
5. Press the upload button. This process will
  1. - extract *xml files and stores them in a Translog-II folder
    - tokenize the source and the target texts
    - segment the texts into sentences
    - align the segments (sentence by sentence) in the log files, if alignment files are not provided
    - upload the aligned sentences to the YAWAT tool

The appearance of the screen after successful upload

Add data to available collections

when collecting and adding data to available ST data collections (these are currently "multiLing", "missionStatement:, "ministerSpeech") the source text files must be identical with respect to text numbering, segmentation and tokenization. Perhaps the best way to ensure this, is to use the available Translog-II *project files, which can be downloaded from this link. Once the Translog-II data is collected and uploaded to the TPRDB as described above, the *src files (in the Alignment folder) should be replaced by the corresponding *src of previous studies and which can also be downloaded from from this same link. For ST-TT alignment and all further automatic processing make sure that the src files in your new study are identical with the src files in the already available studies in that collection.

Uploading Trados data to the TPR-DB

Keylogging data collected with Trados Studio (i.e. with the Qualitivity plugin) can be uploaded to the CRITT TPR-DB. The uploading option can also synchronize with the data of various eye-trackers, Tobii, Eyelink and Gazepoint that is recorded during the translation sessions. More details are provided here and video instruction on YouTube.

Uploading PET data to the TPR-DB

PET is tool to facilitate the post-editing of translations from MT systems. PET produces *per files which contain the keylogging data collected in PET sessions. These *per can be uploaded to the TPR-DB.

zip all PET *per files that are part of the same session. The file root of the zipped *per files should follow the TPR-DB naming conventions. The zip file can have any name
give a study name. The name in the below picture is "GRO8" (preferably only alpha-numeric characters)
Tick under "Task Name" PET to tell the tool that it's PET files.
the "SI" in "Other Task" produces the study "GRO_SI" with automatic word alignment

The PET-to-Translog conversion produces for each PET unit (i.e. the PET-term for segment) a separate Translog-II session. That is, if the file PET name was P33_T1.per and has 21 units (segments), after the conversion there will be 21 Translog sessions: P33_T1001 ... P33_T1021. The PET-to-Translog assumes that all *per files that end in the same number (e.g., 1 in P33_T1) have identical number of units and ST content. TPR-DB computes word translation entropy on this basis.

Speech data and the CRITT TPRDB

The speech signal should be transcribed in such a way that each word comes along with the time-stamp indicating its production time. You can use an automatic transcription (ASR), such as IBM Watson (see tutorial here and here) or Speechmatics (see documentation and some scripts) which in our experience produces better output and punctuation marks. The automatic transcription is then successively revised, e.g. with ELAN, or with a spreadsheet, converted into a Translog-II compatible xml file and uploaded to the CRITT TPR-DB.

Depending on the type of spoken data (interpretation, sight translation, reading aloud, etc.) the voice data can also be synchronized with the gaze data (e.g. during reading or sight translation) or/and with the audio input (e.g. for interpreting, simultaneous interpreting with text).

Generate a TPR-DB with perl scripts

You can also generate TPR-DB tables locally on your computer:

Preparing Windows

1. Install Jdtag (an offline manual word alignment tool): https://drive.google.com/file/d/1BaBuN_hOdp6C1SeVWyoU6CFvBqhtjVGg/view?usp=sharing
  - or alternatively use YAWAT and the management tool for online word alignment
  - or use an automatic word alignment tool
2. Install Perl (needed for running the analysis scripts): http://strawberryperl.com/ or (better?) install cygwin with the perl package
3. Download and unzip the latest TPR-DB version from here.
  make a "tprdb" directory and and place in there at least the bin folder with the perl scripts (i.e. tprdb/bin/)
4. Create a "<study>/Translog-II" folder inside the tprdb directory, i.e.: tprdb/<study>/Translog-II. Place all the Translog-II xml log files in the Translog-II folder with naming conventions given above and insert the language tag.

Tokenize and align texts

In cywin: cd into your tprdb/bin folder and run:

./StudyAnalysis.pl -C tokenize –S <study> -U <user>

This will produce three files inside the folder tprdb/<study>/Alignment with suffixes *.src, *.tgt and *.atag (special tokenizer must be installed for Japanese and Chinese)

Open the Jdtag program and load the *.src, *.tgt and *atag. Manually align words in source and target texts and save the file in the <Study>/Alignment/*.atag

Or use one of the other options for word alignment

Generate tables

In cygwin: cd into your tprdb/bin folder and run:

./StudyAnalysis.pl –C tables –S <Study> -U <user>

This will produce two folders <user>/<study>/Events and <study>/Tables. The folder <study>/Tables contains a number files which are helpful for further data analysis.