Low Resource MT

Task Description

The Low Resource Translation Task addresses a conventional bilingual text translation task in the domain of the TED talks. Given the difficulty of the proposed translation direction and the scarcity of available parallel data, we provide and support the use of additional parallel data from related languages.

  • Language direction:

    • Bilingual Task: Basque -> English (eu-en)

  • In-Domain training and development data:

    • An archive with TED talks (training/dev sets for the Basque-English and additional talks for Basque-French, Basque-Spanish, Spanish-French, Spanish-English, French-English) can be downloaded from the WIT3 website

    • An additional archive with the original xml files of all the 2018 TED talks (excluding those in the tst2018 test set) can be downloaded from the WIT3 website

    • The list of talkid in the tst2018 evaluation set can be downloaded from the WIT3 website

    • Any data of the original TED talks can be downloaded from the TED website (excluding the talks in the tst2018 test set)

  • Out-of-Domain training data:

    • Parallel/monolingual corpora (including Basque data) provided by OPUS can be dowloaded from the OPUS website

    • Parallel/monolingual corpora provided by WMT can be downloaded from the WMT websites

    • Basque-Spanish parallel and monolingual data from the Open Data Euskadi Repository (Thanks to Vicomtech.org for preparing the data!)

  • Evaluation data:

    • Input format: NIST XML format, case sensitive source text with punctuation

    • Output format: NIST XML format, detokenized case sensitive translations with punctuation. NIST XML format is described in this paper (Section 8 and Appendix A); XML templates will be made available; meanwhile you can refer to the XML templates of the 2016 edition

    • Submission: please, refer to the submission guidelines provided below

    • Text encoding: UTF8

    • Tst2018 will include TED talks in Basque that have to be translated into English; it can be downloaded from the WIT3 website

  • Evaluation process:

    • Case sensitive BLEU and NIST scores are computed with the NIST script mteval-v13a.pl, while case sensitive TER score with tercom.7.25.jar. The respective invocations are:

      • mteval-v13a.pl -c

        • java -Dfile.encoding=UTF8 -jar tercom.7.25.jar -N -s

    • The internal tokenization of the two scorers is exploited

  • Evaluation Server:

An online Evaluation Server is available to score systems on development sets. After the evaluation period, the server will score evaluation sets as well. Participants interested in using the server are kindly asked to contact cattoniATfbkDOTeu

Submission Guidelines

Each participant has to submit at least one run for each translation task s/he registered for.

Detokenized case sensitive automatic translations with punctuation have to be wrapped in NIST XML formatted files. NIST XML format is described in this paper (Section 8 and Appendix A); XML templates will be made available; meanwhile you can refer to the XML templates of the 2016 edition

XML files with runs have to be submitted as a gzipped TAR archive (in the format specified below) and e-mailed to cattoniATfbkDOTeu

TAR archive file structure:

<UserID>/<Set>.<Task>.<UserID>.primary.xml

/<Set>.<Task>.<UserID>.contrastive1.xml

/<Set>.<Task>.<UserID>.contrastive2.xml

/...

where:

<UserID> = user ID (short name) of participant provided in the Registration Form

<Set> = IWSLT18.tst2018

<Task> = bilingual_<fromLID>-<toLID>

<fromLID>, <toLID> = Language identifiers (LIDs) as given by ISO 639-1 codes; see here examples of language codes.

The PRIMARY run for the Bilingual Task will be used for the official scoring; nevertheless, CONTRASTIVE runs will be evaluated as well. In the same archive, different runs can be included.

Example:

fbk/IWSLT18.tst2018.bilingual_eu-en.fbk.primary.xml

/IWSLT18.tst2018.bilingual_eu-en.fbk.contrastive1.xml

/IWSLT18.tst2018.bilingual_eu-en.fbk.contrastive2.xml

Re-submitting runs is allowed as far as the mails arrive BEFORE the submission deadline. In case that multiple TAR archives are submitted by the same participant, only runs of the most recent submission mail will be used for the IWSLT 2018 evaluation and previous mails will be ignored.

Schedule

Data available: June, 2018

Test data available: July, 2018

Translation submission deadline: August, 31st, 2018

Task coordinators

Roldano Cattoni

Mauro Cettolo

Marcello Federico