Speech Translation

Task Description

The Speech Translation Task addresses the translation of English audio into German and Portuguese text. In contrast to last year, this year's evaluation campaign contains two different test sets: the traditional TED/Lecture test set and the How2Video test set:

  • End-to-End Evaluation: This year not the single models of the traditional pipeline will be evaluated, but only the end-to-end performance. Every participant has to generate German or Portuguese translation based on the English audio. Thereby, they might use a traditional baseline of different components as well as an end-to-end model. For participants, who want to focus on one component of the pipeline, we provide baseline components for the other parts. If participants wish, also the English transcript in CTM format will be evaluated.
  • Multi-model translation: The how2 data set contains also the video which can be used as additional input to the audio. For training, is this only available for Portuguese. But participants are free to exploit it also for the English-to-German direction (e.g. by multi-lingual models). Videos will be provided for the How2 test sets as well as for the TED test set.
  • Baseline model: We provide a baseline implementation of the traditional pipeline as a Docker container.
  • End-to-End Models: This evaluation should be used to compare end-to-end speech translation models and the translation pipeline approach. Therefore, end-to-end models will be evaluated as a special evaluation condition. Furthermore, we provide an aligned TED corpus of English audio and German text.
  • Multi-lingual adaptation: For the How2 test set only in-domain data is available in English - Portuguese. For English-German only out-of-domain training data is available.
  • Flexible evaluation schedule: The test data will be available for several months, enabling a flexible evaluation schedule for each participant.

Evaluation Conditions

  • Please indicate whether your submission used:
    • End-to-End Model : We use the following definition of end-to-end model:
      • No intermediated discrete representations (source language like in cascade or target languages like in rover)
      • All parameters/parts that are used during decoding need to be trained on the end2end task (may also be trained on other tasks -> multitasking ok, LM rescoring is not ok)
    • Multimodal information

Allowed Training Data

Development and Evaluation Data for TED

  • The development and evaluation data is not segmented using the reference transcript. The archives contain segmentation into sentence-like segmentation using automatic tools. But the participants might also use a different segmentation. The data provided as an archive with the follwing files ($set e.g. IWSLT.TED.dev2010):
    • $set.en-de.en.xml: Reference transcript (will not be provided for evaluation data)
    • $set.en-de.en.xml: Reference translation (will not be provided for evaluation data)
    • CTM_LIST: Ordered file list containing the ASR Output CTM Files (will not be provided for evaluation data) (Generated by ASR systems that use more data)
    • FILE_ORDER: Ordered file list containing the wav files
    • $set.yaml: This file containts the time steps for sentence-like segments. It is generated by the LIUM Speaker Diarization tool.
    • $set.h5: This file contains the 40-dimensional Filterbank features for each sentence-like segment of the test data created by XNMT.
    • The last two files are created by the following command:
  • Development data:
  • Evaluation data:

Development and Evaluation Data for How2:

  • The official how2 evaluation data is segmented using the reference transcript and aligned with the video.
  • Development data:
    • English - Portuguese : Development data is part of the How2 corpus
    • German translation: Availabel here
  • Evaluation data:

Submission Guidelines

  • Multiple run submissions are allowed, but participants must explicitly indicate one PRIMARY run for each track. All other run submissions are treated as CONTRASTIVE runs. In the case that none of the runs is marked as PRIMARY, the latest submission (according to the file time-stamp) for the respective track will be used as the PRIMARY run.
  • Submissions have to be submitted as a gzipped TAR archive (see format below) and sent as an email attachment to jan.niehues@kit.edu and sebastian.stueker@kit.edu.
  • Each run has to be stored in SGML format or plain text file with one sentence per line
  • Scoring will be case-sensitive and including the punctuation. Submissions have to be in UTF-8.

TAR archive file structure:

< UserID >/< Set >.< Task >.< UserID >.primary.xml

/< Set >.< Task >.< UserID >.contrastive1.xml

/< Set >.< Task >.< UserID >.contrastive2.xml

/...

where:

< UserID > = user ID of participant used to download data files

< Set > = IWSLT18.SLT.tst2018

<Task> = <fromLID>-<toLID>

<fromLID>, <toLID> = Language identifiers (LIDs) as given by ISO 639-1 codes; see for example the WIT3 webpage