WNGT 2020 Efficiency Shared Task

While the original shared task has finished, we welcome new submissions to make this a rolling evaluation. Overview paper Results Spreadsheet


The efficiency task measures machine translation inference performance on CPUs and GPUs with standardized training data and hardware. Participants provide their own code and models. This is the third time we run the efficiency shared task, building on the 2019 and 2018 tasks that in turn were based on the WAT 2017 Small-NMT Task.

Translation Task

We follow the constrained condition of the WMT 2019 English-German news translation task. We recommend using the WMT 2018 test set for development. Additional data from WMT 2020 is not allowed.

The input and output format are raw text files with one sentence per line. Lines are separated in UNIX style. Participants are responsible for their own tokenization and detokenization, which will be included in run time. Unlike previous years, we do not mandate a particular tokenization.

Hardware

There are two types of hardware: CPU and GPU.

For CPUs, performance will be measured on an AWS c5.metal instance. This machine has 96 hyperthreads on a Cascade Lake architecture and 192 GB RAM. This year, there are two CPU tracks: use the entire machine or use only one core.

For GPUs, performance will be measured on an AWS g4dn.xlarge instance. This machine has one NVidia T4 GPU while the host has 4 vCPUs and 16 GB RAM. The NVidia driver will be version 440.44.

Amazon has donated some AWS credits so participants can test on the exact hardware. These are available on a first-come first-served basis and there is some lead time required to get GPUs approved in your account. Contact wngt-info@googlegroups.com.

Measurement

Each system will be run with 1 million lines of raw English input, where each line has at most 100 space-separated words (though your tokenizer will probably break that into more). We will measure the following:

  • Quality on an undisclosed test set scattered through the input, as measured by uncased BLEU.
  • Real time taken by the entire translation command.
  • Peak host RAM consumption.
  • Peak GPU RAM consumption in the GPU track.
  • Size of the model on disk. Docker images should include a separate model directory.
  • Total size of the docker image on disk.

Results will be reported in a table showing all metrics. The presentation will include a series of Pareto frontiers comparing quality with each of the efficiency metrics. We welcome participants optimizing any of the metrics.

Unlike past years, we will not be subtracting loading time from run times. The large input is intended to amortize loading time. Based on last year's results, we anticipate that all submissions will complete in under 2 hours.

What is a model?

We will report model size on disk, which means we need to define a model as distinct from code. The model includes everything derived from data: all model parameters, vocabulary files, BPE configuration if applicable, quantization parameters or lookup tables where applicable, and hyperparameters like embedding sizes. You may compress your model using standard tools (gzip, bz2, xzip, etc.) and the compressed size will be reported.

Code can include simple rule-based tokenizer scripts and hard-coded model structure that could plausibly be used for another language pair. If we suspect that your model is hidden in code, we may ask you to provide another model of comparable size for a surprise language pair with reasonable quality.

Baseline Components

This is an open task and participants can use whatever software and models they want subject to the hardware and data constraints.

Using existing constrained WMT 2019 systems, such as Facebook's, is permissible. Microsoft is working to provide the sentence-level components of their English-German submission to WMT 2019 as well. These are not optimized for speed, but might be useful as teachers for a teacher-student setting.

Docker Submission

Competitors should submit a Docker image with all of the software and model files necessary to perform translation.

  • The name of the image should be: wngt2020_<team-name>_<system-name>
  • The image should contain a model directory /model with all the model files as defined above.
  • The image contains at least a shell script: /run.sh (run.sh file at the root directory) which executes the actual translation process implemented in the image.
  • /run.sh should take just two arguments: in-file and out-file, and be able to be executed by:sh /run.sh <in-file> <out-file>
    • in-file is a text file. Each line of the file (separated by a UNIX newline) contains a plain-text UTF-8 English sentence to be translated.
    • out-file is a text file. Each line of the file (separated by a UNIX newline) contains a plain-text UTF-8 English translation of the same line as in the input.
  • Competitors can also add any other directories and files in the image, except any paths starting with /wnmt, which are reserved by the evaluation system.

Competitors can assume that the system is launched by following commands:

docker load -i <image-file>

docker run [restriction-options (see below)] [--runtime=nvidia (if using GPUs)] --name=translator -itd wnmt2020_<team-name>_<system-name>

docker cp <src> translator:<in-file>

docker exec translator sh run.sh <in-file> <out-file>

docker cp translator:<out-file> <hyp>

docker rm -f translator

Submission Information

Send a submission mail to wngt-info@googlegroups.com with the following information:

  • Team name (acceptable characters: alphanumeric and hyphen)
  • Name of the primary member of the team.
  • List of entries, each of which is a docker image and a label indicating CPU or GPU.

Teams are encouraged to make multiple entries to cover more of the Pareto frontier or contrast methods.

Timeline

All deadlines are 23:59:59 anywhere on earth (UTC-12).

* April 24 2020: shared task submissions

* May 5, 2020: system description submissions at https://www.softconf.com/acl2020/wngt/ . Despite what the form says, system descriptions do not need to be anonymous. Teams can be uniquely identified by the number of submissions they made anyway.

* May 25 2020: camera-ready system description