Uploading Data
The TPR-DB management tool is a a browser interface through which one can upload and download raw logging data and generate TPD-DB tables. Data can be downloaded from the public section. A 'private' account is required for uploading and processing data in the CRITT TPR-DB. Interested researchers can request a CRITT membership.
TPR-DB tables facilitate the analysis of the logging data through a large set of pre-defined features. This management tool allows you to:
Upload translation data gathered with Translog-II (Translation study)
Align and error-annotate translation sessions
Generate and download TPR-DB tables
To upload your study, it is crucial that you prepare your files according to the TPR-DB conventions. Follow the steps below and post your technical, methodological, and theoretical questions and comments here if something is unclear.
Prepare TPR-DB session files
TPR-DB naming conventions
The TPR-DB is an anonymized repository of logged translation sessions and contains a number of different modes, such as post-editing of machine translation and interactive post-editing (e.g. CASMACAT project), in addition to human translations.
The logged UAD is contained in one single file, where the file name contains important variables of the translation study. For example:
P01_T1.xml: participant #1 translated text 1
P01_P2.xml: participant #1 post-edited text 2
P15_C3.xml: participant #15 copied text 3
That is, Participant, Task, Text, in the following format:
Participants are numbered P01 to Pn
Task codes:*
T: from scratch Translation
MT: machine translation output
P: Post-Editing of Machine Translation
E Monolingual Editing (Post-Editing without Source Text)
C: Text Copying
R: Translation Revision
D: for Dictation (with realtime ASR; e.g. Dragon Speech)
S: Spoken Translation production (e.g. transcribed sight translation)
I: Simultaneous Interpreting
ST: simultaneous interpretation with source text
A: Authoring (monolingual text production)
L: Reading
LV: reading aloud (for German vorlesen)
U: monolingual summarization
H: monolingual paraphrasing
Texts are numbered 1 to m
Adding Language tag
Before uploading a Translog-II log file through the TPR-DB management tool you must insert a language tag. The server can then know which are the source and the target languages and process files accordingly: Open the xml file and insert a line <Languages … /> with the source and target languages in the session, at the position indicated below. IMPORTANT: Make sure to use the correct quotes, and no additional spaces between attributes and values!
...
<fullScreen>false</fullScreen>
<lockWindows>false</lockWindows>
<Languages source="en" target="es" task="translating" />
<Plugins>
<Key_Logger />
...
Upload and process Studies
You need an account to generate a TPR-DB from the management tool with your own study. Contact m.gummiball[at]gmail.com to obtain an account that allows you to upload your studies and to change the word alignments. You can also generate a TPR-DB from logging data using perl scripts.
Log in to the TPR-DB management tool
Open Browser Mozilla Firefox (avoid Chrome or IExplorer)
Log in to the YAWAT: https://critt.as.kent.edu/cgi-bin/yawat/yawat.cgi
Provide login name and password
Change the address in the address bar manually to: https://critt.as.kent.edu/cgi-bin/yawat/tpd.cgi
You can upload and download studies and generate/download TPR-DB tables there
Upload a new study
Make sure that
Zip your Translog-II xml log files, and alignment files (*src, *tgt, *atag) if available. Do not use any other compression tool.
On the management tool, click on the "Browse" button and locate the zip file.
Provide a study name (e.g. bella in the screenshot below) which should consist of letters and/or digits. Do not use a space in the name.
Press the upload button. This process will
extract *xml files and stores them in a Translog-II folder
tokenize the source and the target texts
segment the texts into sentences
align the segments (sentence by sentence) in the log files, if alignment files are not provided
upload the aligned sentences to the YAWAT tool
Add data to available collections
when collecting and adding data to available ST data collections (these are currently "multiLing", "missionStatement:, "ministerSpeech") the source text files must be identical with respect to text numbering, segmentation and tokenization. Perhaps the best way to ensure this, is to use the available Translog-II *project files, which can be downloaded from this link. Once the Translog-II data is collected and uploaded to the TPRDB as described above, the *src files (in the Alignment folder) should be replaced by the corresponding *src of previous studies and which can also be downloaded from from this same link. For ST-TT alignment and all further automatic processing make sure that the src files in your new study are identical with the src files in the already available studies in that collection.
Uploading Trados data to the TPR-DB
Keylogging data collected with Trados Studio (i.e. with the Qualitivity plugin) can be uploaded to the CRITT TPR-DB. The uploading option can also synchronize with the data of various eye-trackers, Tobii, Eyelink and Gazepoint that is recorded during the translation sessions. More details are provided here and video instruction on YouTube.
Uploading PET data to the TPR-DB
PET is tool to facilitate the post-editing of translations from MT systems. PET produces *per files which contain the keylogging data collected in PET sessions. These *per can be uploaded to the TPR-DB.
zip all PET *per files that are part of the same session. The file root of the zipped *per files should follow the TPR-DB naming conventions. The zip file can have any name
give a study name. The name in the below picture is "GRO8" (preferably only alpha-numeric characters)
Tick under "Task Name" PET to tell the tool that it's PET files.
the "SI" in "Other Task" produces the study "GRO_SI" with automatic word alignment
The PET-to-Translog conversion produces for each PET unit (i.e. the PET-term for segment) a separate Translog-II session. That is, if the file PET name was P33_T1.per and has 21 units (segments), after the conversion there will be 21 Translog sessions: P33_T1001 ... P33_T1021. The PET-to-Translog assumes that all *per files that end in the same number (e.g., 1 in P33_T1) have identical number of units and ST content. TPR-DB computes word translation entropy on this basis.
Speech data and the CRITT TPRDB
The speech signal should be transcribed in such a way that each word comes along with the time-stamp indicating its production time. You can use an automatic transcription (ASR), such as IBM Watson (see tutorial here and here) or Speechmatics (see documentation and some scripts) which in our experience produces better output and punctuation marks. The automatic transcription is then successively revised, e.g. with ELAN, or with a spreadsheet, converted into a Translog-II compatible xml file and uploaded to the CRITT TPR-DB.
Depending on the type of spoken data (interpretation, sight translation, reading aloud, etc.) the voice data can also be synchronized with the gaze data (e.g. during reading or sight translation) or/and with the audio input (e.g. for interpreting, simultaneous interpreting with text).
Generate a TPR-DB with perl scripts
You can also generate TPR-DB tables locally on your computer:
Preparing Windows
Install Jdtag (an offline manual word alignment tool): https://drive.google.com/file/d/1BaBuN_hOdp6C1SeVWyoU6CFvBqhtjVGg/view?usp=sharing
or alternatively use YAWAT and the management tool for online word alignment
or use an automatic word alignment tool
Install Perl (needed for running the analysis scripts): http://strawberryperl.com/ or (better?) install cygwin with the perl package
Download and unzip the latest TPR-DB version from here.
make a "tprdb" directory and and place in there at least the bin folder with the perl scripts (i.e. tprdb/bin/)Create a "<study>/Translog-II" folder inside the tprdb directory, i.e.: tprdb/<study>/Translog-II. Place all the Translog-II xml log files in the Translog-II folder with naming conventions given above and insert the language tag.
Tokenize and align texts
In cywin: cd into your tprdb/bin folder and run:
./StudyAnalysis.pl -C tokenize –S <study> -U <user>
This will produce three files inside the folder tprdb/<study>/Alignment with suffixes *.src, *.tgt and *.atag (special tokenizer must be installed for Japanese and Chinese)
Open the Jdtag program and load the *.src, *.tgt and *atag. Manually align words in source and target texts and save the file in the <Study>/Alignment/*.atag
Or use one of the other options for word alignment
Generate tables
In cygwin: cd into your tprdb/bin folder and run:
./StudyAnalysis.pl –C tables –S <Study> -U <user>
This will produce two folders <user>/<study>/Events and <study>/Tables. The folder <study>/Tables contains a number files which are helpful for further data analysis.