WNGT 2019 Efficiency Shared Task

This shared task aims to reveal more about the trade-off between translation quality and translation efficiency. This is the second time we run the efficiency shared task. Last year's efficiency shared task was inspired by the small NMT task at the workshop on Asian Translation.


Efficiency can include a number of concepts:

  • Memory Efficiency: We would like to have small models. Evaluation measures include:
    • Size on disk of the model
    • Number of parameters of the model
    • Size in memory of the full program
  • Computational Efficiency: We would like to have fast models. Evaluation measures include:
    • Time to decode the test set in:
      • A single CPU thread/a single GPU
      • Single sentence decoding/minibatch decoding


TRACKS

The goal of the task will be to find systems that are both accurate and efficient. In other words, we want to find systems on the Pareto Frontier of efficiency in accuracy. Participants can submit any system that they like, and any system on this Pareto frontier will be considered advantageous. However, we will particularly highlight systems that satisfy one of the two categories:

  • Efficiency track: We will have a track where the models that perform at least as well as the baseline attempt to create the most efficient implementation. Here, the winner will be the system that achieves a baseline BLEU score with the highest efficiency, memory or computational.
  • Accuracy track: We will have a track where models that are at least as efficient as the baseline attempt to improve the BLEU score. Here, the winner will be the system that can improve accuracy the most without a decrease in efficiency.


CORPUS

  • WMT2014 English-German: Preprocessed corpus provided from the Stanford NLP Group.
    • Training: train.{en,de}
    • Validation: newstest2013.{en,de}
    • Test: newstest2014.{en,de}, newstest2015.{en.de}
    • Other data: prohibited


PROCEDURE

Providing the Docker image of the translator system

Competitors should submit a Docker image with all of the software and model files necessary to perform translation.

  • The name of the image should be: wngt2019_<team-name>_<system-name>
  • The image contains at least a shell script: /run.sh (run.sh file at the root directory) which executes the actual translation process implemented in the image.
  • /run.sh should take just two arguments: in-file and out-file, and be able to be executed by:sh /run.sh <in-file> <out-file>
    • in-file is a text file. Each line of the file contains a space-separated input tokens which are to be translated.
    • out-file is a text file. Each line of the file contains a space-separated tokens, which are generated by the translation process from the same line in in-file.
  • Competitors can also add any other directories and files in the image, except any paths starting with /wnmt, which are reserved by the evaluation system.


Executing provided system on the Amazon EC2, and gathers runtime metrics

Competitors can assume that the system is launched by following commands:


docker load -i <image-file>

docker run [restriction-options (see below)] [--runtime=nvidia (if using GPUs)] --name=translator -itd wnmt2018_<team-name>_<system-name>

docker cp <src> translator:<in-file>

docker exec translator sh run.sh <in-file> <out-file>

docker cp translator:<out-file> <hyp>

docker rm -f translator


There are some resource constraints on the machines that we will use to run the evaluating process:

  • Running with only CPUs:
    • AWS Instance Type: m5.large
    • Number of CPU cores: 1
    • Disk space: 16 GB
    • Host memory: 4 GB
    • Execution time: 1 hour
  • Running with GPUs:
    • AWS Instance Type: p3.2xlarge
    • Number of CPU cores: 1
    • Disk space: 16 GB
    • Host memory: 4 GB
    • Execution time: 1 hour
    • GPU: 1 x NVIDIA Tesla V100
    • NVIDIA Driver Version: 384.111


METRICS

Our evaluator will gather following metrics of the competitor’s systems:

  • Running time from launching the container to finishing all translation processes.
  • Peak consumption of the host memory.
  • Peak consumption of the GPU memory.
  • Consumption of the disk space/size of the Docker image.
  • MT evaluation metrics of generated results (BLEU/NIST).

1. to 3. are measured using at least two files:

  • in-file with no lines. This is used to measure the loading overhead of the system.
  • in-file with actual test sentences (in this task, newstest201{4,5}.en).

If the process (each trial of run.sh) did not finish within the specified time limit, the server will kill all running processes, and will not record any results about the submitted system.


BASELINE SYSTEMS

We provide 3 baseline images:

  • wngt2019_organizer_echo
    • Just sending the input back to the output.
    • This performs extremely fast and consumes very small memory, but also generates the worst BLEU (usually 0 or very small value).
  • wngt2019_organizer_nmt-1cpu
    • Performing an encoder-decoder-attention translation system on one CPU.
  • wngt2019_organizer_nmt-1gpu
    • Same as nmt-1cpu, but also uses one GPU through CUDA.

Each docker images can be downloaded from the Google Drive.



SUBMISSION INFORMATION

  • Participating the task
    • Send a submission mail to wngt2019-organizers [at] googlegroups.com with the following information:
      • Team name (acceptable characters: alphanumeric and hyphen).
      • Name of the primary member of the team.
      • System preference:
        • Will use CPU evaluation (YES/NO)
        • Will use GPU evaluation (YES/NO)
  • Submitting the system description paper


IMPORTANT DATES

  • April 8, 2019: Task announcement, data release
  • August 26, 2019: System results due
  • September 2, 2019: System descriptions due
  • September 16, 2019: System description feedback provided
  • September 30, 2019: Camera-ready system descriptions due
  • November 4, 2019: Presentation at the workshop


RESULTS


Contact Information

If you have any questions, please contact wngt2019-organizers@googlegroups.com