Shared Task

Update! (4/14): Evaluation metrics of baseline systems are added.

Update! (4/10): Amazon has graciously offered free computation time to sponsor the evaluation, and also prepared a system/compute for interested participants.

Update! (3/27): We have changed the AWS instance for CPUs from T2 to M5.

Update! (3/21): We have changed the GPU type from K80 to V100.

Update! (5/19): Official evaluation results are published!

Basic Idea

The basic idea of this task (inspired by the small NMT task at the workshop on Asian Translation) is that for NMT, not only accuracy, but also test-time efficiency is important.

Efficiency can include a number of concepts:

  • Memory Efficiency: We would like to have small models. Evaluation measures include:
    • Size on disk of the model
    • Number of parameters of the model
    • Size in memory of the full program
  • Computational Efficiency: We would like to have fast models. Evaluation measures include:
    • Time to decode the test set in a single CPU thread
    • Time to decode the test set on a single GPU

Tracks

The goal of the task will be to find systems that are both accurate and efficient. In other words, we want to find systems on the Pareto Frontier of efficiency in accuracy. Participants can submit any system that they like, and any system on this Pareto frontier will be considered advantageous. However, we will particularly highlight systems that satisfy one of the two categories:

  • Efficiency track: We will have a track where the models that perform at least as well as the baseline attempt to create the most efficient implementation. Here, the winner will be the system that achieves a baseline BLEU score with the highest efficiency, memory or computational.
  • Accuracy track: We will have a track where models that are at least as efficient as the baseline attempt to improve the BLEU score. Here, the winner will be the system that can improve accuracy the most without a decrease in efficiency.

Corpus

  • WMT2014 English-German: Preprocessed corpus provided from the Stanford NLP Group.
    • Training: train.{en,de}
    • Validation: newstest2013.{en,de}
    • Test: newstest2014.{en,de}, newstest2015.{en.de}
    • Other data: prohibited

Procedure

Providing the Docker image of the translator system

Competitors should submit a Docker image with all of the software and model files necessary to perform translation.

  • The name of the image should be: wnmt2018_<team-name>_<system-name>
  • The image contains at least a shell script: /run.sh (run.sh file at the root directory) which executes the actual translation process implemented in the image.
  • /run.sh should take just two arguments: in-file and out-file, and be able to be executed by: sh /run.sh <in-file> <out-file>
    • in-file is a text file. Each line of the file contains a space-separated input tokens which are to be translated.
    • out-file is a text file. Each line of the file contains a space-separated tokens, which are generated by the translation process from the same line in in-file.
  • Competitors can also add any other directories and files in the image, except any paths starting with /wnmt, which are reserved by the evaluation system.

Executing provided system on the Amazon EC2, and gathers runtime metrics

Competitors can assume that the system is launched by following commands:

docker load -i <image-file>

docker run [restriction-options (see below)] [--runtime=nvidia (if using GPUs)] --name=translator -itd wnmt2018_<team-name>_<system-name>

docker cp <src> translator:<in-file>

docker exec translator sh run.sh <in-file> <out-file>

docker cp translator:<out-file> <hyp>

docker rm -f translator

There are some resource constraints on the machines that we will use to run the evaluating process:

  • Running with only CPUs:
    • AWS Instance Type: m5.large
    • Number of CPU cores: 1
    • Disk space: 16 GB
    • Host memory: 4 GB
    • Execution time: 1 hour
  • Running with GPUs:
    • AWS Instance Type: p3.2xlarge
    • Number of CPU cores: 1
    • Disk space: 16 GB
    • Host memory: 4 GB
    • Execution time: 1 hour
    • GPU: 1 x NVIDIA Tesla V100
    • NVIDIA Driver Version: 384.111

Metrics

Our evaluator will gather following metrics of the competitor’s systems:

  • Running time from launching the container to finishing all translation processes.
  • Peak consumption of the host memory.
  • Peak consumption of the GPU memory.
  • Consumption of the disk space/size of the Docker image.
  • MT evaluation metrics of generated results (BLEU/NIST).

1. to 3. are measured using at least two files:

  • in-file with no lines. This is used to measure the loading overhead of the system.
  • in-file with actual test sentences (in this task, newstest201{4,5}.en).

If the process (each trial of run.sh) did not finish within the specified time limit, the server will kill all running processes, and will not record any results about the submitted system.

Baseline Systems

We provide 3 baseline images:

  • wnmt2018_organizer_echo
    • Just sending the input back to the output.
    • This performs extremely fast and consumes very small memory, but also generates the worst BLEU (usually 0 or very small value).
  • wnmt2018_organizer_nmt-1cpu
    • Performing an encoder-decoder-attention translation system on one CPU.
  • wnmt2018_organizer_nmt-1gpu
    • Same as nmt-1cpu, but also uses one GPU through CUDA.

Each docker images can be downloaded from the Google Drive.

Following table shows the metrics of baseline systems on the evaluation server (expected; please grab results for official metrics):

  • newstest2014:
  • newstest2015:

(BLEU are calculated without any postprocessings)

In addition, Amazon has released a Sockeye system that competitors can use for the WNMT18 shared task with step-by-step directions and a link for downloading a pre-trained model, which can be found here.

Computation Credits

Amazon has kindly donated computation credits for teams that wish to test their models on AWS. Any teams that plan on participating in the shared task, please contact the shared task organizers and we will be happy to give you credits.

Also, if you are interested in training systems using Sockeye but don't have resources to do so on your own, there are larger portions of free credits available for a limited number of teams. Similarly, get in contact with the task organizers and we will help you out.

Submission Information

  • Participating the task
    • Send a submission mail to wnmt2018-shared-task [at] googlegroups.com with the following information:
      • Team name (acceptable characters: alphanumeric and hyphen).
      • Name of the primary member of the team.
      • System preference:
        • Will use CPU evaluation (YES/NO)
        • Will use GPU evaluation (YES/NO)
  • Submitting the system description paper

Results

We finally got 13 systems (6 for CPU and 7 for GPU) from 4 teams. All metrics we retrieved are listed in Google Spreadsheet.

Contact Information

If you have any questions, please contact wnmt2018-shared-task [at] googlegroups.com

Important Dates

  • Jan 20, 2018: Task announcement
  • Mar 9, 2018: Baseline release
  • May 14, 2018: System submission deadline
  • May 18, 2018: Final result announcement/draft papers due
  • May 22, 2018: Review feedback of system description papers
  • May 28, 2018: Camera-ready papers due
  • Jul 20, 2018: Workshop at ACL in Melbourne!