Alignment

In order to generate meaningful TPR-DB tables, the source and target texts need to be properly aligned on a segment and a word level. The TPR-DB management tool automatically generates sentence-based segment alignments. However, you may need to manually adjust them as it is not possible to align words across segment boundaries in YAWAT. The most efficient way therefore is 1) to see the automatically-produced representation in YAWAT to check the sentence alignment, 2) to manually adjust sentence alignment if necessary, and 3) to start word alignments. 

Below, you will see the following instructions in details:

You can also download these instructions as a PDF file from here

Post your technical, methodological, and theoretical questions and comments here

Inspect (public) studies in the TPR-DB

You can inspect alignments of public studies in the TPR-DB on YAWAT. Please use Firefox and avoid IExplorer or Chrome.

Click on any of the listed studies and then on any session. You can hover over the words and see the word alignments. You can download the TPR-DB as described here.

Reference:

Ulrich Germann Yawat. 2008. 

Yet Another Word Alignment Tool Proceedings of the ACL-08: HLT Demo Session (Companion Volume) , pages 20–23. Association for Computational Linguistics 

http://www.aclweb.org/anthology/P08-4006

Another visualization of translation process data with R is shown on this page

How to access YAWAT tool for alignment

Word alignment can be done manually with the YAWAT tool. YAWAT can be accessed from the TPR-DB management tool (see "How to log in to the TPR-DB management tool"). 

Make TPR-DB tables with new word alignment with YAWAT:

You can create new TPR-DB tables with the new alignment information once you are done with all the alignments in one study. To do this, on the TPR-DB management tool:

To get a personal YAWAT account, please send an email to m.gummiball[at]gmail.com.

word alignment with YAWAT

How to do word-alignment in YAWAT

When you align the ST words and TT words, try to align every single word including all punctuation marks at as smallest level as possible. For example, the phrase [Killer nurse] in the ST and [enfermero asesino] in the Spanish TT should be aligned as [Killer - asesino] and [nurse - enfermero], NOT as [killer nurse - asesino enfermero]. Follow the steps described below:


Post your technical, methodological, and theoretical questions and comments here.

sentence segmentation and alignment

In order to do word alignments in YAWAT, the translations must be split up into an equal number of segments on the source and the target sides which are translations of each other. Each segment may consist of one or more sentences (sub-segments). It is possible to represent within YAWAT m-to-n sentence alignments and within the aligned segment any conceivable m-to-n word alignment is permitted. However, it is not possible to align words in YAWAT across segment boundaries. 

The TPR-DB management tool automatically generates sentence-based segment alignments, assuming that texts are translated sentence by sentence. However, the automatic sentence segmentation sometimes does not work properly (for instance when several ST sentences are translated into one TT sentence, or vice versa) and/or in some cases sentences are detected and split up where there is actually no sentence boundary.

Examples of wrong segment alignment

Figure 1 below shows an example where the English source text is wrongly split after a dot ("2 ."). Here the wrongly inserted segment boundary should be deleted and the two segments 8 and 9 should be joined into one sentence, "and 2. Load sharp knives ...". Figure 2 is an example of a translation where two English source sentences (Segment 10 and 11) are translated as a single sentence (10) in Chinese and the translation of Segment 12 is incorrectly displayed next to Segment (11) as a result. In this case we would like to keep the segment boundary but tell YAWAT to join the two English segments (10 and 11) into one, so they can be word-aligned in a single box on YAWAT.

Figure 1. Incorrect ST segmentation
Figure 2. Two ST sentences translated into one

Representation and manual correction of sentence alignment

Alignment information is contained in three files: the source text files (*.src) contains ST tokens, the target text files (*.tgt) contains TT tokens and the alignment files (*.atag) contains information concerning the relations between the two. Information in all three files could be modified to adjust wrong tokenization and segment/sentence alignments. You can see and modify the word id and the segment (segId) in the *.src and *.tgt files as in Example 1.


    <W cur="447" id="82" segId="7">;</W>

    <W cur="450" id="83" segId="8" space="&#xA;">and</W>

    <W cur="454" id="84" segId="8" space=" ">2</W>

    <W cur="455" id="85" segId="8">.</W>

    <W cur="457" id="86" segId="9" space=" ">Load</W>

    <W cur="462" id="87" segId="9" space=" ">sharp</W>

Example 1: Token information in the *.src and *.tgt files 

Wrong sentence segmentation could be corrected by manually modifying the value of the segId in the *.tgt files, i.e. replacing all segId="9" by segd="8"  in Example 1, and successively decreasing all the following segIds). However: do not change the src files unless you are producing a brand new study, as the tokenization needs to be identical across all sessions.

Information concerning the ST-TT alignment is contained in the *atag files. The sentence alignment is captured in the "salign" elements. The "src" tag refers to segId in the *.src file and "tgt" tag refers to the segId in the *.tgt file. For example, if a participant translated two ST sentences (10 and 11 in src) into one TT sentence (10 in tgt), you can adjust the segment alignment as shown in Example 2, where the source sentence 10 and 11 are both aligned to target sentence 10. Successively the ST sentence 12 would be aligned to the TT sentence 11, etc.


    <alignFile href="P02_T4.src" key="a" sign="_input"/>

    <alignFile href="P02_T4.tgt" key="b" sign="_input"/>

    <salign src="9" tgt="9" />

    <salign src="10" tgt="10" />

    <salign src="11" tgt="10" />

    <salign src="12" tgt="11" />

Example 2: alignment information in the "P02_T4.tgt” file

The figure shows the effect of this alignment in YAWAT. Assigning src="0" or tgt="0" indicates that the segment is not aligned. Figure 3 below is the representation of appropriately aligned segments. 

Figure 3. Appropriately aligned segments

The manually amended alignment files can be uploaded again through the TPR-DB management tool in two different ways:

Using Notepad++ for segmentation and segment alignment

Instead of editing the Alignment files *.tgt, *.src and *.atag manually, it is also possible to convert the information into textual format, edit the segments in an editor (e.g. Notepad++) and re-convert it back into the *.src, *.tgt and *.atag files. The following perl script reads the three Alignment files and produces source and target text files. The same script also does the reverso operation:  

    ./Atag2Sentences.pl -A P22_P3

The command will generate two files P22_P3.SourceTok and P22_P3.TargetTok. These files can be loaded into an editor (for instance Notepad++) and visualized in two views (in Notepad++ under View → Move/Clone Current Document → Move to other View). Figure 4 below shows an example of paragraph alignment view.

Figure 4. Appearance of Notepad++

Each line in the editor represents a segment; successive lines in the two editors represent successive segment alignments. The number of lines in the source and the target editor should thus be identical.

Wrong sentence segmentation can be rectified by inserting or deleting line breaks. For instance, the three wrong segments above

    7.     ... door seal ;

    8.     and 2 .

    9.     Load sharp knives

can be joined in a single segment by joining the lines:

    7.     ... door seal ;

    8.    and 2 .  Load sharp knives

Segments can also be split up. For instance, in this example, the headline in the beginning of the translation was not automatically detected as an independent segment and was merged with the second segment.

A line break can be inserted to split the paragraph into two:

Several sentences can be joined into one alignment segment by inserting  a triple slash ‘///’ as for instance in line 10 in the English source window (See Figure).

When inserting a triple slash ‘///’ it must be taken care of that white spaces are preceding and following it, since otherwise it will be attached to the (preceding or following) token and recognized as a non-permitted modification of the token. Take care not to change any tokens in the file, as that will corrupt the analysis. This is not meant to be a post-editing tool, but rather only a tool to adjust sentence segmentation and alignment. 

By typing the following command, the new sentence segmentation and alignment information can be back-converted to the *.src, *.tgt and *.atag files:

    ./Atag2Sentences.pl -A P22_P3 -o P22_P3

The parameter -A P22_P3 is the basename for the *.src, *.tgt and *.atag files, the parameter -A P22_P3 expects the two text files, P22_P3.SourceTok and P22_P3.TargetTok. The function overwrites the three *.src, *.tgt and *.atag files, but an optional parameter -O <basename> may be used to re-direct the output to the three files <basename>.{src,tgt,atag}. As a result of the operations discussed above, the P22_P3.atag should now contain the following segment alignment information:

Figure 5.  Converted P22_P3.atag file

The editor can also be used to adjust wrong word tokenization. Since the TPR-DB tokenizer has very little information, abbreviations like “etc.”, or as above "2." might be split into two tokens “etc” and “.” The dot will then be recognized as a full stop, introducing an erroneous segment boundary. This can be adjusted by 1) deleting the blank space between the two tokens, joining the two tokens into one “etc.”, or 2) only deleting the line break as described above. The method 1) should be applied with caution to *tgt files and not to the *.src files, if compatibility and comparability between different translations of the same source is expected. If tokenization is modified, the back-conversion function will then assign new Id numbers to the tokens which might be incompatible with other versions of the same source file. The editor should not be used to post-edit translations or to change the sequence of characters. Only the addition or deletion of white-space characters (i.e. blank space, line break, tab) are permitted, if the re-generated tokenization and alignment files are supposed to remain compatible with the translation process data (i.e. recorded keystokes).

The changed files can than be zipped and uploaded to the tpd server. The uploaded Alignment files will overwrite the existing ones as explained above.