TPR-DB management tool

The TPD-DB management tool facilitate the analysis of the logging data through a browser interface. It allows you to:

Upload and process your studies in the TPR-DB

You need an account to generate a TPR-DB from the TPR management tool with your own study. Contact mc.ibc[at]cbs.dk to obtain an account that allows you to upload your studies and to change the word alignments. You can also generate a TPR-DB from logging data using perl scripts

Log in to the Yawat management tool:

  • You can upload and download studies and generate TPR-DB tables

  • You can download alignment and table information that is contained in the most recent TPR-DB

Upload a new study to the TPR-DB:

  • Place your Translog-II xml logfiles in a zipped folder; the folder can also contain alignment files (*src, *tgt, *atag), if available. Take care that:

    • The Translog-II log files follow the naming convention (e.g. P01_T05.xml, see below)

    • The xml files contain correct source-target language tags (see below)

  • Provide a study name which should consist of upper-case letters followed by a few digits

  • Press the upload button:

    • The upload function will extract *xml files and store them in a Translog-II folder
    • It will tokenize and sentence-align the texts in the log files, if alignment files are not provided

    • It will generate a Yawat representation that can be manually checked

  • After successful uploading you should be able to see something like this

Edit word alignment with Yawat:

  • Press “Open Yawat” to edit word alignment with Yawat. Once you are done with the alignment, you can go back to the management tool

  • Press “Save Yawat Alignments” to save your alignment modifications on the server

  • After saving the alignments, the buttons should turn black and you can download your new alignments

 Make TPD-DB tables with new word alignment with Yawat:

  • Press “Make Tables” to (re-) generate TPR-DB tables with the new alignment information

  • The alignment information and tables can be downloaded with the download buttons

  • After you have downloaded the Alignment and the Table folders you can click on “Delete Study” so as to remove the traces on the server. If you want to add a study permanently to the publicly available TPR-DB you are invited to send a mail.

Prepare session log files for uploading to the TPR-DB

Naming conventions for Translog files in the TPR-DB:

The TPR-DB is an anonymized repository of logged translation sessions. The logged UAD is contained in one single file, where the file name contain the main random variables of the translation study: Participant, Task, Text, in the following format:   

    Participant_TaskText.xml


Participants are numbered P01 to Pn, texts are numbered 1 to m, and task codes are as follows:

  • T for from scratch Translation

  • P for Post-editing of machine translation

  • E for monolingual Editing (Post-editing without Source Text)

  • C for text Copying

  • R for Revision

  • S for Spoken translation production (e.g. transcribed sight translation)

  • I for transcribed Interpreting

  • A for Authoring (text production)

  • L for reading

A session log filename is a concatenation this information as the following session file names show:

  • P01_T1.xml: participant P01 Translated text 1

  • P01_P2.xml: participant P01 Post-edited text 2

  • P03_C3.xml: participant P03 Copied text 1

All log files can be placed in one folder, which can then be zipped for uploading through the tpd management tool.

Insert language tag in Translog-II file:

Before uploading a Translog-II log file through the tpd management tool (see below) you must insert a Languages tag. The server can then know which are the source and the target languages and process files accordingly:

Open the xml file and insert a line <Languages … /> with the source and target languages in the session, at the position indicated below. IMPORTANT: take care to use the correct quotes, and no additional spaces between attributes and values!

...

<fullScreen>false</fullScreen>

<lockWindows>false</lockWindows>

<Languages source="en" target="es" task="translating" />
<Plugins>

<Key_Logger />

You can also download these instructions from  here


Edit sentence segmentation and sentence alignment

In order to do word alignments within YAWAT, the translations are split up into an equal number of paragraphs on the source and the target sides. Each paragraph may consist of one or more segments (sentences), but the number of segments in one paragraph should be minimal so as to represent shortest possible alignment correspondences. It is possible to represent within YAWAT m-to-n segment alignments and within the aligned paragraphs any conceivable m-to-n word alignment is permitted, even across segment boundaries. However, it is not possible to align words in YAWAT across paragraph boundaries.

When automatically processing the segment alignment relations, the tpd management tool assumes that sentences are translated in a one-to-one fashion . The tpd management tool automatically generates sentence-based paragraph alignments. However, in some cases this might not reflect the situation in the translated texts, and/or the automatic sentence segmentation does not work properly. Segments might be split up at wrong positions or segments have no 1-to-1 correspondence.This section discusses how to adjust this. 

Examples of wrong segment alignment

Non-matching source-target alignments may be due to wrong sentence segmentation and/or to the fact that sentences are not translated one-by-one. The Figure below (left) shows an example where the English source text is wrongly split after a dot (2 .). Here the wrongly inserted segment boundary should be deleted and the two segments 8 and 9 should be joined into one. The right figure is an example of a translation where two English source sentences (10 and 11) on the left are translated as a single sentence (10) into Chinese. In this case we would like to keep the segment boundary but tell YAWAT to join the two English segments into one, so they can be word-aligned.

A symmetric situation is of course also conceivable where the target text is either wrongly split into non-segments or where a translator produces two sentences for one source sentence. In any case, where the segmentation should be based on monolingual considerations (i.e. sentence boundaries), the alignment of the segments is a cross-lingual consideration and need not be one-to-one. In order to remedy segment alignment, the alignment information has to be manually adjusted.

Representation and manual correction of alignment information

Alignment information is contained in three files:

the source tokens (*.src), target tokens (*tgt) and alignment relations (*.atag) between source and target tokens. Wrong tokenization, sentence segmentation as well as word and sentence alignment can be (manually) rectified in these files.

<W cur="447" id="82" segId="7">;</W>
<W cur="450" id="83" segId="8" space="&#xA;">and</W>
<W cur="454" id="84" segId="8" space=" ">2</W>
<W cur="455" id="85" segId="8">.</W>
<W cur="457" id="86" segId="9" space=" ">Load</W>
<W cur="462" id="87" segId="9" space=" ">sharp</W>

Example 1: token information in the *.src and *.tgt files indicating the word id and the segment (segId).

Wrong sentence segmentation can be corrected by manually modifying the value of the segId in the target (*tgt) files (or the *src files). i.e. replace segId="9"  by segId="8" in Example 1. However be aware that changing the src files might lead to incompatibility when comparing different translations. Modifying the word id or the segId of the source files may cause problems during further processing.

Information concerning the segment alignment of source and target texts is contained in the *atag files though the salign elements. Any m-to-n sentence alignment is possible. In Example 2 the source sentences 10 and 11 are aligned to target sentence 10, and successive src segment 12 is aligned with tgt segment 11, etc.

<alignFile href="P02_T4.src" key="a" sign="_input"/>
<alignFile href="P02_T4.tgt" key="b" sign="_input"/>
<salign src="9" tgt="9" />
<salign src="10" tgt="10" />
<salign src="11" tgt="10" />

<salign src="12" tgt="11" />

Example 2: alignment information between the “P22_P3.src" and the ="P22_P3.tgt” file

The figure shows the effect of this alignment in YAWAT. Assigning src="0" or tgt="0" indicates that the segment is not aligned.

The manually amended alignment files can be uploaded again through the tpd management in two different ways:

  • As described above (Upload a new study to the TPR-DB) : place all alignment files for the study in a zip folder together with the Translog-II log files. Type in a new study name; the old study can then be deleted

  • Upload only the modified *atag and/or *tgt /*src  files to overwrite the files of the same name in an existing study(a backup is created on the server)

As the tpd management tool assumes that source texts within one study with the same text identification number are identical, sentence segmentation should only be modified on the target language side, so that the source files (src)  remain identical for all sessions. On the target side, modifications can be don with respect to word tokenization, sentence segmentation, and segment alignment.

Using Notepad++ for segmentation and segment alignment

Instead of editing the Alignment files *tgt, *src and *atag manually, it is also possible to automatically convert the information into textual format, edit the segments in an editor (e.g. Notepad++) and re-convert it back into the *src, *tgt and *atag files. The following perl script reads the three Alignment files and produces source a target text files. The same script also does the reverso operation. It can bee called, for instance  

./Atag2Sentences.pl -A P22_P3

The command will generate to files P22_P3.SourceTok and P22_P3.TargetTok. These files can be loaded into an editor (for instance Notepad++) and visualized in two views (in Notepad++ under View → Move/Clone Current Cocument → Move to other View). Figure below shows an example of paragraph alignment view.


Each line represents a paragraph and successive lines in the two editors represent successive paragraph alignments. The number of lines in the source and the target editor must thus be identical.

Wrong sentence segmentation can be rectified by inserting or deleting line breaks. For instance, the three wrong segments above

    7.     ... door seal ;

    8.     and 2 .

    9.     Load sharp knives

can be joined in a single segment by joining the lines:


    7.     ... door seal ;

    8.    and 2 .  Load sharp knives

Segments can also be split up. For instance, in this example, the headline in the beginning of the translation was not automatically detected as an independent segment and was merged with the second segment.

  1. 生活 成本 上升 使 家庭 生活 遭受 冲击 由于 食品 燃料 价格 17 速度 飙升 英国 家庭 每年 必须 支出 31,300 英镑

A line break can be inserted to split the paragraph into two:

  1. 生活 成本 上升 使 家庭 生活 遭受 冲击

  2. 由于 食品 燃料 价格 17 速度 飙升 英国 家庭 每年 必须 支出 31,300 英镑


Several segments can be joined into one alignment paragraph by inserting  a triple slash ‘///’ as for instance in line 10 in the English source window (See Figure).

 When inserting a triple slash ‘///’ it must be taken care of that white spaces are preceding and following it, since otherwise it will be attached to the (preceding or following) token and recognized as a non-permitted modification of the token.

By typing the following command, the new segmentation and alignment information can be back-converted to the *src, *tgt and *atag files:

./Atag2Sentences.pl -A P22_P3 -S P22_P3

The parameter -A P22_P3 expects the basename of the src, *tgt and *atag files, the parameter -S P22_P3 expects the two text files, P22_P3.SourceTok and P22_P3.TargetTok. The function overwrites the three *src, *tgt and *atag files, but an optional parameter -O <basename> may be used to re-direct the output to the three files <basename>.{src,tgt,atag}. As a result of the operations discussed above, the P22_P3.atag should now contain the following segment alignment information:


The editor can also be used to adjust wrong word tokenization. Since the TPR-DB tokenizer has very little information, abbreviations like “etc.”, or as above "2." might be split into two tokens “etc” and “.” The dot will then be recognized as a full stop, introducing an erroneous segment boundary. This can be adjusted by 1) deleting the blank space between the two tokens, joining the two tokens into one “etc.”, or 2) only deleting the line break as described above. The method 1) should be applied with caution to *tgt files and not to the *.src files, if compatibility and comparability between different translations of the same source is expected. If tokenization is modified, the back-conversion function will then assign new Id numbers to the tokens which might be incompatible with other versions of the same source file. The editor should not be used to post-edit translations or to change the sequence of characters. Only the addition or deletion of white-space characters (i.e. blank space, line break, tab) are permitted, if the re-generated tokenization and alignment files are supposed to remain compatible with the translation process data (i.e. recorded keystokes).

The changed files can than be zipped and uploaded to the tpd server. The uploaded Alignment files will overwrite the existing ones as explained above.

Generate a TPR-DB with perl scripts

Preparing your Windows machine:

  1. Install Jdtag (offline word alignment tool): https://dl.dropbox.com/u/7757461/bin/jdtag-0.0.7.jar
    or alternatively use YAWAT and the management tool for online word alignment 

  2. Install Perl (needed for running the analysis scripts): http://strawberryperl.com/  or (better?) install cygwin with the perl package

  3. Download and unzip the latest TPR-DB version  https://sites.google.com/site/centretranslationinnovation/tpr-db 
    (make a folder "tprdb" directory and and place in there at least the bin folder) 

  4. Create a "<study>/Translog-II" folder inside the tprdb directory, i.e.: tprdb/<study>/Translog-II. Place all the Translog-II xml log files in the Translog-II folder with naming conventions given above and insert the language tag. 

Tokenize and align texts

in cywin: cd into your tprdb/bin folder and run:

./StudyAnalysis.pl -C tokenize –D <study>

this will produce three files inside the folder <Study>/Alignment with suffixes *.src, *.tgt and *atag (special tokenizer must be installed for Japanese and Chinese)

Open the Jdtag program and load the *.src, *.tgt and *atag. Manually align words in source and target texts and save the file in the <Study>/Alignment/*.atag

Generate table files

in cygwin: cd into your tprdb/bin folder and run:

./StudyAnalysis.pl –C tables –S <Study>

this will produce two folders <Study>/Events and <Study>/Tables. The folder <Study>/Tables contains a number files which are helpful for further data analysis.