We extend the deadline for system submission to 8/29, 23:59:59 (UTC -12, "anywhere on earth"). System description deadline remains the same.
We would also like to clarify frequently asked questions regarding the submissions:
Resource constraints are clarified below. However, you can still make submissions if you do not follow the constraints, but must notify us on submission.
The WNGT 2019 DGT shared task on "Document-Level Generation and Translation” considers generating textual documents from either structured data or documents in another language or both. We plan for this to be both a task that pushes forward document-level generation technology and also a way to compare and contrast methods for generating from different types of inputs.
There will be three types of systems that we will compare in the shared task:
In addition, the data, listed below, is English-German, and accordingly, there will be two target languages:
This results in a total of 6 tracks.
The dataset that we use is a subset of the RotoWire dataset [1], a dataset of basketball-related articles along with information about the basketball game in structured data. The original RotoWire dataset is an English dataset that has been used for data-to-text natural language generation, and we have had a portion of this dataset manually translated into German. Specifically, the statistics are below:
The RotoWire English-German dataset (v1.5) comes with two formats where both are split into identical train / development / test sets:
original
: JSON format. Contains all the statistics from the original dataset and the German summaries.plaintxt
: Parallel texts between English and German. Files are separated according to the ID of documents. Each file consists of one sentence pair per line.The original
additionally stores the following new fields compared to the original RotoWire dataset:
id
: ID of a document.summary_*
: The tokenized English and German summaries, no sentence boundaries.sentence_end_index_*
: List of indices pointing to the ends of sentences for English and German summaries.* can either be en
or de
.
NEW: All texts are tokenized. We provide participants with the tokenizer which was used to process the dataset. This can be useful when incorporating external resources described below by making sure that the tokenization is consistent across datasets.
Notably, the English-German training dataset is small (much smaller than the full English dataset listed below), reflecting the resource constraints that we will encounter when trying to apply these systems to new languages. Because of this, you are further allowed to use the resources below.
MT+NLG Resources (usable in all tracks):
NLG Resources (usable in the NLG track and MT+NLG track):
MT Resources (usable in the MT and MT+NLG track):
Monolingual Resources (usable in all tracks):
If there are any additional resources you would like to see added, please contact the organizers by the “resource addition cutoff date” listed below.
There will be a baseline system trained based on OpenNMT as described in [1].
Systems will be evaluated on the test split of RotoWire English-German dataset according to standard automatic measures, at least BLEU for the MT track, ROUGE and BLEU for the NLG and NLG+MT tracks, and content-oriented metrics (Content Selection, Relation Generation and Content Ordering [1]) for the (monolingual) NLG track. In addition, we are hoping, but not guaranteeing, that some degree of human evaluation on the results will be performed. Suggestions for evaluation methods are also very welcome.
Helper tools can be downloaded here.
1. Preparing the format
For all the tracks, we ask the participants to save the generated results in a similar format to the original dataset. Specifically, a submission file should be a single JSON file which contains a list of records with the following fields:
id
: ID of each document.summary
: Word-tokenized generated summary.For example, a valid submission file would look like below:
[
{"id": "02_24_16-Cavaliers-Hornets-TheEasternConference-leadingClevelandCavaliers",
"summary": ["Die", "in", "der", ...]},
{"id": "01_01_16-Knicks-Bulls-TheChicagoBulls(19",
"summary": ["Die", "Chicago", "Bulls", ...]},
{"id": "11_07_16-Pelicans-Warriors-AnthonyDaviscontinuestobe",
"summary": ["Anthony", "Davis", "ist", ...]},
{"id": "04_01_16-Cavaliers-Hawks-Inwhatwasahistoric",
"summary": ["In", "einer", "historischen", ...]},
{"id": "01_07_17-Thunder-Nuggets-RussellWestbrookrecordedyetanother",
"summary": ["Russell", "Westbrook", "verzeichnete", ...]},
...
]
Note that indentation like the example above is not needed for the submission file.
For MT track, we provide a script which converts sentence-by-sentence plaintext outputs into the specified format. Download the helper tools and run the script as follows:
$ python plain2json.py --source-dir /path/to/translations --target-json output.json
where each file in /path/to/translations
directory should have one target language sentence per line.
2. Validate the submission file
Download the helper tools and run the validator as follows:
$ python validate_outputs.py /path/to/your/submission/file
Please fix the errors if prompted.
3. Submit
Please follow here to proceed with your submission. You will also need to submit a system description, and you can find more information about this on the call for papers page.
The following people are responsible for organizing the task.
Please feel free to contact at any time by contacting wngt2019-organizers@googlegroups.com.
[1] Sam Wiseman, Stuart Shieber and Alexander Rush. Challenges in Data-to-Document Generation. EMNLP 2017.