First Workshop on Simple and Efficient Natural Language Processing
Shared Task: Call for Submissions
SustaiNLP 2020 (co-located with EMNLP2020) is organizing a shared task to promote the development of effective, energy-efficient models for difficult NLU tasks. This shared task is centered around the SuperGLUE benchmark, which tests a system’s performance across a diverse set of eight NLU tasks. In addition to the standard SuperGLUE performance metrics, the shared task will evaluate the energy consumption of each submission while processing the test data.
Fairly evaluating training efficiency in the most general setting is challenging. Nearly all current approaches to NLU tasks start training with pretrained models and embeddings, but accounting for the energy consumption represented by these pretrained elements is not obvious. On the other hand, forbidding participants from using these components would likely render the task less useful and raise the barrier for participation. Moreover, as large-scale pretrained models reach production, the cumulative lifetime environmental cost of the models will be mostly constituted by the computational cost of inference, thus calling for particular attention to this stage. As a consequence, we decided to focus on inference for this iteration of the shared task.
We will evaluate energy efficiency via energy consumption, as measured by the experiment-impact-tracker library (Henderson et al. 2020). Several methods for measuring computational efficiency have been proposed, ranging from FLOPS to inference time. Recent work advocates tracking energy consumption as a metric that allows for comparing the performance of different architectures in the fairest way.
Participants should provide their own model code and trained models, though the use of existing libraries and pretrained models is allowed. A sample script for training models on SuperGLUE is available here and can be used as a starting point. In addition to the rules listed here, participants should abide by the rules of participation of the SuperGLUE benchmark (see question 2 here). The file format will follow that of the official distribution.
Submissions will be run on standardized machines with access to the following:
CPUs: two Cascade Lake processors with 192 GB RAM.
GPUs: four Nvidia V100 GPUs (32 GB), Nvidia driver version 430.14, CUDA version 10.2.
We are not using Docker due to issues measuring energy usage from a container.
Submissions will be evaluated based on the overall SuperGLUE score (except WSC, see below) and the total energy consumed from CPU, DRAM, and GPU during inference as detailed in equation 1 of Henderson et al. 2020. To more accurately estimate energy consumption, each submission will be run 5 times and the average performance and energy consumption will be used for scoring.
Task performance will be measured using the official SuperGLUE evaluation server. Therefore, submissions should produce prediction files adhering to the SuperGLUE format. As obtaining nontrivial performance on WSC typically requires a large amount of task-specific engineering, WSC will not be used for the shared task. Submissions do not need to produce predictions for its test set and the score will not be included in the overall performance metric. Similarly, submissions do not need to produce predictions for the diagnostic datasets.
Energy consumption will be measured by the experiment-impact-tracker library. A submission should provide an inference script that will produce predictions for the test set. When evaluating a submission, we will use a Python wrapper that invokes this tracking library (`tracker.launch_impact_monitor()`) just before starting the inference script. We will use the “total_power” metric measured by the library.
A submission’s inference script can either produce predictions for all tasks at once or one specified task at a time, in which case the energy consumption to produce predictions for all tasks will be summed (and then averaged across runs). The inference script should only assume access to raw data as downloaded from the SuperGLUE site, and not rely on cached preprocessed data.
To rank submissions, we will first group submissions into tracks by SuperGLUE score. There will be three tracks, defined as follows:
Track 1: performance greater than that of bert-base-uncased on all tasks, but less than that of roberta-large on all tasks
Track 2: performance greater than that of roberta-large on all tasks
Track 3 (CPU only): performance greater than that of bert-base-uncased on all tasks (no maximum score), but submissions will be restricted to only use CPU
Submissions will be assigned to the track based on the hardware used and the lowest threshold met (e.g. if a submission’s performance on one task is below that of roberta-large, but performance on the other tasks and the average performance are above it, it will still be considered in Track 1). For tasks with multiple evaluation metrics (e.g. EM and F1), the unweighted average of the metrics will be used to determine performance on that task. Exact minimum performance thresholds are as follows:
Within each track, submissions will be ranked by lowest energy consumption. We encourage participants to make submissions to each track, though participants will be limited to six submissions total.
To submit an entry, prepare a .zip file or create a Git repo with the model and code. Participants are free to use any system and framework as long as it can be provided with a python wrapper adequate to run the measurement library.
The submission should have a top-level README detailing the team name, submission name (to differentiate multiple submissions from a team), how and where to download any pretrained weights, and a script to perform inference on the SuperGLUE test set. Submissions should be emailed to firstname.lastname@example.org with the following information:
Name of the primary contact of the team.
Attachment of or link to the submission as detailed above.
The organizing committee of the workshop reserves the right to remove any submissions that the OC as a whole concludes are trying to unfairly manipulate the metric, such as by caching precomputed representations for the public test data.
Shared task submissions due:
August 28, 2020September 11, 2020
System descriptions due: September 11, 2020
Camera-ready system description: October 10, 2020
Workshop: November 11, 2020
All deadlines are 11.59 pm UTC -12h (“anywhere on Earth”).