Consider to publish your next paper in "Special Issue on Human and Machine Translation: Recent Trends and Foundations". Click here:

Word and Segment Alignment

In order to generate meaningful TPR-DB tables, the (source and target) texts need to be properly aligned on a segment and word level.

- Word alignment with YAWAT
- Optional: Word translation error-annotation
- Segment align

Post your technical, methodological, and theoretical questions and comments here

word alignment with YAWAT

In order to compute TPR-DB summary tables, the translations must be properly aligned on a word and on a segment level. Word-level alignment can be done manually with the YAWAT tool. YAWAT can be accessed from the TPR_DB management tool:

- In the management tool: Press “Open Yawat” to edit word alignment with YAWAT. Follow these instructions. Once you are done with the alignment, you can go back to the management tool
- Press “Save Yawat Alignments” to save your alignment modifications on the server
- After saving the alignments, the buttons should turn black and you can download your new alignments

Make TPR-DB tables with new word alignment with YAWAT:

- Press “Make Tables” to (re-) generate TPR-DB tables with the new alignment information
- The alignment information and tables can be downloaded with the download buttons
- After you have downloaded the Alignment and the Table folders you can click on “Delete Study” so as to remove the traces on the server. If you want to add a study permanently to the publicly available TPR-DB you are invited to send a mail to m.gummiball[at]gmail.com.

You can also download these instructions from here

Post your technical, methodological, and theoretical questions and comments here

Annotating alignment groups in YAWAT

ST words should be aligned with the TT words as complete and as compositionally as possible, i.e., try to align every single word and all punctuation marks, but try to create the smallest possible alignment. For example, if the source text says [Killer nurse] and the Spanish TT says [enfermero asesino], then align [Killer - asesino] and [nurse - enfermero] and not [killer nurse - asesino enfermero]. :

Select the alignment group by left-clicking on the ST and TT elements to be aligned. For example, left-click on “killer” and then on “asesino”.
To confirm that you have created an alignment group, right-click on one of the elements of the group. For example, “asesino”. Both elements will form an alignment group and they will be marked on grey.
In order to annotate the group, right-click on one of the elements (for instance, “asesino”). If there is no error, left-click on the default label “Unspecified (no error)”. If there is an error in the aligned group, left-click on the error to be annotated from the 10 errors in the section “error codes”.
In case you would like to annotate an unaligned ST or TT word (mono-label), right-click on the unaligned ST or TT word and select one of the two categories possible (Addition/Omission and Unintelligible)..
Once you are done with the annotation of a segment, click on “done” next to the segment number. Changes will be saved.

It sometimes is the case, that the automatic segment alignment is not correct. If that is the case, please follow instructions here.

To get a personal Yawat account, please send an email to m.gummiball[at]gmail.com.

Post your technical, methodological, and theoretical questions and comments here.

Error annotation in YAWAT

While most of the error-based manual translation quality measures assess the quality of the translations, we want to know which ST words are prone to produce which kind of errors in the translation. We therefore suggest error codes that apply to ST-TT alignment groups. Use the YAWAT browser-based tool to mark alignment groups and annotate these groups with an MQM error code.

Error Codes (definitions from MQM):

Accuracy:

1. Mistranslation: The target content does not accurately represent the source content.
2. Addition: The target text includes text not present in the source.
3. Omission: Content is missing from the translation that is present in the source.

Fluency:

1. Word Form: There is a problem in the form of a word, including aspects of agreement, tense-mood.
2. Cohesion: Portions of the text needed to connect it into an understandable whole are missing or incorrect.
3. Word Order: The word order is incorrect
4. Punctuation: Punctuation is used incorrectly (for the locale or style)

Spelling: Issues related to spelling of words
1. Unintelligible: The exact nature of the error cannot be determined. Indicates a major breakdown in fluency.

Mono and bi-labels:

YAWAT allows for two sets of error labels, depending on whether the label is to be attached to an alignment group (bi-labels), or to an unaligned ST or TT word (mono-labels).

bi-labels (alignment groups):

- Word Order, Cohesion, Word Form, Punctuation, Spelling, Mistranslation
- Unspecified: is the default label, meaning there is no translation error in the alignment group [default]

mono-labels (unaligned words in the ST or TT):

- Unintelligible
- Addition, Omission: are the default labels for unaligned ST and TT words respectively

Severity:

Four bi-labels for alignment groups: Word Order, Cohesion, Word Form, and Mistranslation can be assigned a severity label minor or critical error. Punctuation and Spelling can only be minor error.

Precedence:

Any word or alignment group can have exactly one error label. In case an ST-TT alignment group can be annotated with more than one error code, a label should be chosen that matches the most severe error.

Error Coding:

Error coding in two steps: first mark ST-TT equivalences as alignment groups. The translation should be aligned as exhaustively and as compositionally as possible. Then annotate each alignment group with a bi-label if applicable. For words that cannot be aligned (or attached to an alignment group), use the error category from the mono-labels.

1. every alignment group will be automatically assigned the default label (unspecified). No further action is required, if the alignment group has no error.
2. the entire alignment group is marked with an error even if only one target word in an m-to-n translation relation is erroneous. For instance, assume the English phrase “to encourage” is correctly translated into a German discontinuous expression “um … zu ermutigen”. Many different errors may occur (errors in bold):
  1. Word Form: "to encourage X → um X’ zum ermutigen", if “zum” is a wrong word form (wrong inflection/wrong preposition).
  2. Cohesion: "to encourage X → um X’ ermutigen", if compulsory “zu” is missing in the translation (missing preposition).
  3. Mistranslation: "to encourage X → um X’ zu erfreuen", if “erfreuen” is a wrong word choice (wrong lexeme).
  4. Word Order: "to encourage X → um zu X’ ermutigen", if the translation has a wrong word order. Mark either “X → X’” or “to encourage → um zu ermutigen” depending on which is shorter.
  5. Several errors in one alignment group: "to encourage X → um erfreuen X’", select one error code per alignment group, even if several errors apply (Mistranslation, Cohesion, Word Order), and select the most serious error.
    1. If a mistake concerns only one ST word, the entire alignment group must be labelled with the error to which the erroneous word belongs.

All error codes refer to the TL segment. The alignment group is assigned an error only if the TL segment has an error (or mistake). If the error is severe to an extent that ST-TT correspondences cannot be established, or the text is unintelligible, use a mono-label.
If an error stretches over several alignment groups, each group is marked independently if they contribute equally to the error. For example, assume the two words “life → perpetuas” and “sentences → cadenas” are independently aligned in a reversed word order translation:
Word Order: “four life sentences → cuatro perpetuas cadenas”, if “perpetuas” and “cadenas” are in a wrong target language word order

Code retrieval:

annotated error codes can be retrieved via the TPR-DB management tool. A feature "Yawat" is generated in the *st and *tt files which contain the error codes for the annotated words.

Post your technical, methodological, and theoretical questions and comments here.

sentence segmentation and sentence alignment

In order to do word alignments within YAWAT, the translations must be split up into an equal number of segments on the source and the target sides. Each segment may consist of one or more sentences (sub-segments). It is possible to represent within YAWAT m-to-n sentence alignments and within the aligned segment any conceivable m-to-n word alignment is permitted. However, it is not possible to align words in YAWAT across segment boundaries.

When automatically processing the segment alignments, the TPR-DB management tool assumes that sentences are translated in a one-to-one fashion. The TPR-DB management tool automatically generates sentence-based segment alignments. However, in some cases this might not reflect the situation in the translated texts, and/or the automatic sentence segmentation does not work properly. Segments might be split up at wrong positions or segments have no 1-to-1 correspondence. This section discusses how to adjust this.

Examples of wrong segment alignment

The Figure below (left) shows an example where the English source text is wrongly split after a dot (2 .). Here the wrongly inserted segment boundary should be deleted and the two segments 8 and 9 should be joined into one. The right figure is an example of a translation where two English source sentences (10 and 11) on the left are translated as a single sentence (10) into Chinese. In this case we would like to keep the segment boundary but tell YAWAT to join the two English segments into one, so they can be word-aligned.

A symmetric situation is of course also conceivable where the target text is either wrongly split into non-segments or where a translator produces two sentences for one source sentence. In any case, where the segmentation should be based on monolingual considerations (i.e. sentence boundaries), the alignment of the segments is a cross-lingual consideration and does not need to be one-to-one. In order to remedy segment alignment, the alignment information has to be manually adjusted.

Representation and manual correction of alignment information

Alignment information is contained in three files:

the source tokens (*.src), target tokens (*tgt) and alignment relations (*.atag) between source and target tokens. Wrong tokenization, sentence segmentation as well as word and sentence alignment can be (manually) rectified in these files.

<W cur="462" id="87" segId="9" space=" ">sharp</W>

Example 1: token information in the *.src and *.tgt files indicating the word id and the segment (segId).

Wrong sentence segmentation can be corrected by manually modifying the value of the segId in the target (*tgt) files (or the *src files). i.e. replace segId="9" by segId="8" in Example 1. However, be aware that changing the src files might lead to incompatibility when comparing different translations. Modifying the word id or the segId of the source files may cause problems during further processing.

Information concerning the segment alignment of source and target texts is contained in the *atag files though the salign elements. Any m-to-n sentence alignment is possible. In Example 2 the source sentences 10 and 11 are aligned to target sentence 10, and successive src segment 12 is aligned with tgt segment 11, etc.

Example 2: alignment information between the “P22_P3.src" and the ="P22_P3.tgt” file

The figure shows the effect of this alignment in YAWAT. Assigning src="0" or tgt="0" indicates that the segment is not aligned.

The manually amended alignment files can be uploaded again through the tpd management tool in two different ways:

- As described above (Upload a new study to the TPR-DB) : place all alignment files for the study in a zip folder together with the Translog-II log files. Type in a new study name; the old study can then be deleted
- Upload only the modified *atag and/or *tgt /*src files to overwrite the files of the same name in an existing study (a backup is created on the server)

As the tpd management tool assumes that source texts within one study with the same text identification number are identical, sentence segmentation should only be modified on the target language side, so that the source files (src) remain identical for all sessions. On the target side, modifications can be done with respect to word tokenization, sentence segmentation, and segment alignment.

Using Notepad++ for segmentation and segment alignment

Instead of editing the Alignment files *tgt, *src and *atag manually, it is also possible to automatically convert the information into textual format, edit the segments in an editor (e.g. Notepad++) and re-convert it back into the *src, *tgt and *atag files. The following perl script reads the three Alignment files and produces source and target text files. The same script also does the reverso operation. It can be called, for instance

./Atag2Sentences.pl -A P22_P3

The command will generate two files P22_P3.SourceTok and P22_P3.TargetTok. These files can be loaded into an editor (for instance Notepad++) and visualized in two views (in Notepad++ under View → Move/Clone Current Cocument → Move to other View). Figure below shows an example of paragraph alignment view.

Each line represents a paragraph and successive lines in the two editors represent successive segment alignments. The number of lines in the source and the target editor must thus be identical.

Wrong sentence segmentation can be rectified by inserting or deleting line breaks. For instance, the three wrong segments above

7. ... door seal ;

8. and 2 .

9. Load sharp knives

can be joined in a single segment by joining the lines:

7. ... door seal ;

8. and 2 . Load sharp knives

Segments can also be split up. For instance, in this example, the headline in the beginning of the translation was not automatically detected as an independent segment and was merged with the second segment.

1. 生活成本上升使家庭生活遭受冲击由于食品和燃料价格以 17 年来以最快的速度飙升，英国家庭每年必须多支出 31,300 英镑。

A line break can be inserted to split the paragraph into two:

1. 生活成本上升使家庭生活遭受冲击
2. 由于食品和燃料价格以 17 年来以最快的速度飙升，英国家庭每年必须多支出 31,300 英镑。

Several segments can be joined into one alignment paragraph by inserting a triple slash ‘///’ as for instance in line 10 in the English source window (See Figure).

When inserting a triple slash ‘///’ it must be taken care of that white spaces are preceding and following it, since otherwise it will be attached to the (preceding or following) token and recognized as a non-permitted modification of the token.

By typing the following command, the new segmentation and alignment information can be back-converted to the *src, *tgt and *atag files:

./Atag2Sentences.pl -A P22_P3 -o P22_P3

The parameter -A P22_P3 expects the basename of the src, *tgt and *atag files, the parameter -A P22_P3 expects the two text files, P22_P3.SourceTok and P22_P3.TargetTok. The function overwrites the three *src, *tgt and *atag files, but an optional parameter -O <basename> may be used to re-direct the output to the three files <basename>.{src,tgt,atag}. As a result of the operations discussed above, the P22_P3.atag should now contain the following segment alignment information:

The editor can also be used to adjust wrong word tokenization. Since the TPR-DB tokenizer has very little information, abbreviations like “etc.”, or as above "2." might be split into two tokens “etc” and “.” The dot will then be recognized as a full stop, introducing an erroneous segment boundary. This can be adjusted by 1) deleting the blank space between the two tokens, joining the two tokens into one “etc.”, or 2) only deleting the line break as described above. The method 1) should be applied with caution to *tgt files and not to the *.src files, if compatibility and comparability between different translations of the same source is expected. If tokenization is modified, the back-conversion function will then assign new Id numbers to the tokens which might be incompatible with other versions of the same source file. The editor should not be used to post-edit translations or to change the sequence of characters. Only the addition or deletion of white-space characters (i.e. blank space, line break, tab) are permitted, if the re-generated tokenization and alignment files are supposed to remain compatible with the translation process data (i.e. recorded keystokes).

The changed files can than be zipped and uploaded to the tpd server. The uploaded Alignment files will overwrite the existing ones as explained above.

Google Sites

Report abuse