The WNGT 2020 DGT shared task on "Document-Level Generation and Translation” considers generating textual documents from either structured data or documents in another language or both.
Depending on the input / output pairs, there will be three types of systems that we will compare in the shared task:
In addition, the data, listed below, is English-German, and accordingly, there will be two target languages:
This results in a total of 6 tracks. Participants can choose to work on a subset of the task or all the tracks.
Diagram of different tracks.
We re-use RotoWire English-German dataset from the DGT task at WNGT2019. This is a subset of RotoWire dataset [1] with professional German translations. Each instance in this dataset has three components: Box score of a NBA game, the game summary in English, its German translation by domain experts. Please see here for more details. The downloaded dataset format is described here.
Participants can further utilize the following resources for respective tracks. Systems that use data resources other than the listed resources will be marked as unconstrained
and will be compared separately for fairer comparison.
There will be a baseline system as described in [2]. Systems will be evaluated on the test split of RotoWire English-German dataset according to the following metrics:
Content accuracy evaluation is performed using this tool. Suggestions for evaluation methods are also very welcome.
Helper tools can be downloaded here.
For all the tracks, we ask the participants to save the generated results in a similar format to the original dataset. Specifically, a submission file should be a single JSON file which contains a list of records with the following fields:
id
: ID of each document.summary
: Word-tokenized generated summary.For example, a valid submission file would look like below:
[
{"id": "02_24_16-Cavaliers-Hornets-TheEasternConference-leadingClevelandCavaliers",
"summary": ["Die", "in", "der", ...]},
{"id": "01_01_16-Knicks-Bulls-TheChicagoBulls(19",
"summary": ["Die", "Chicago", "Bulls", ...]},
{"id": "11_07_16-Pelicans-Warriors-AnthonyDaviscontinuestobe",
"summary": ["Anthony", "Davis", "ist", ...]},
{"id": "04_01_16-Cavaliers-Hawks-Inwhatwasahistoric",
"summary": ["In", "einer", "historischen", ...]},
{"id": "01_07_17-Thunder-Nuggets-RussellWestbrookrecordedyetanother",
"summary": ["Russell", "Westbrook", "verzeichnete", ...]},
...
]
Note that indentation like the example above is not needed for the submission file.
For MT track, we provide a script which converts sentence-by-sentence plaintext outputs into the specified format. Download the helper tools and run the script as follows:
$ python plain2json.py --source-dir /path/to/translations --target-json output.json
where each file in /path/to/translations
directory should have one target language sentence per line.
Download the helper tools and run the validator as follows:
$ python validate_outputs.py /path/to/your/submission/file
Please fix the errors if prompted.
Please follow this submission form.
All deadlines are 23:59:59 anywhere on earth (UTC-12).
[1] Sam Wiseman, Stuart Shieber and Alexander Rush. Challenges in Data-to-Document Generation. EMNLP 2017.
[2] Ratish Puduppully, Li Dong, and Mirella Lapata. Data-to-text generation with content selection and planning. AAAI 2019.